Show Summary Details

Page of

PRINTED FROM the OXFORD RESEARCH ENCYCLOPEDIA, ECONOMICS AND FINANCE ( (c) Oxford University Press USA, 2020. All Rights Reserved. Personal use only; commercial use is strictly prohibited (for details see Privacy Policy and Legal Notice).

date: 25 January 2020

Health Status Measurement

Summary and Keywords

Health status measurement issues arise across a wide spectrum of applications in empirical health economics research as well as in public policy, clinical, and regulatory contexts. It is fitting that economists and other researchers working in these domains devote scientific attention to the measurement of those phenomena most central to their investigations. While often accepted and used uncritically, the particular measures of health status used in empirical investigations can have sometimes subtle but nonetheless important implications for research findings and policy action. How health is characterized and measured at the individual level and how such individual-level measures are summarized to characterize the health of groups and populations are entwined considerations. Such measurement issues have become increasingly salient given the wealth of health data available from population surveys, administrative sources, and clinical records in which researchers may be confronted with competing options for how they go about characterizing and measuring health. While recent work in health economics has seen significant advances in the econometric methods used to estimate and interpret quantities like treatment effects, the literature has seen less focus on some of the central measurement issues necessarily involved in such exercises. As such, increased attention ought to be devoted to measuring and understanding health status concepts that are relevant to decision makers’ objectives as opposed to those that are merely statistically convenient.

Keywords: health status, time use, multiple endpoints, aggregation, population heterogeneity, health economics


This article concerns the measurement of health. Specifically, it addresses several themes arising when health status measurement is a prominent feature of empirical health economics and related investigations, the results of which may be used to inform policy decisions in health and regulatory economics contexts. Measurement matters.

To suggest to an empirical investigator working in financial economics, labor economics, agricultural economics, etc., the importance of high-quality, decision-relevant measurement of the concepts of interest (prices, quantities, environments, etc.) would seem a settled matter. But in health economics, how often are measurement issues involving health status taken with comparable seriousness? The hope is “always”; the reality is “sometimes.”

Why is measurement of health in economic analysis important? At least two reasons are immediately apparent. The first is intrinsic interest for scientific, policy, regulatory, etc., purposes in which the focus is on understanding health (“h”) as an outcome arising from treatments or other factors h=h(x,u) or as a predictor of other outcomes of interest y=y(h,x,u). The second reason is an instrumental one: for scientific, policy, regulatory, etc., purposes, health outcomes play a central role in evaluation (CEA, CBA, etc.) criteria v(h,y,...), for example, an incremental cost-effectiveness ratio (c1c0)/(h1h0), a net health benefit criterion NHB=h(c/λ) (see Stinnett & Mullahy, 1998), a benefit-cost test b(h)c(h), etc. Either way, thoughtful conceptualization and measurement of what is meant by “health” should lead to better understanding of and decision-making regarding the questions at hand.

“Measurement” can mean many different things: What does it mean to measure “health” at an individual level, to describe an individual at a point in time by one or more numbers? How does one usefully summarize or aggregate multiple measures of individual health in a manner that respects the integrity and properties of the multiple measures? When the population of interest is one whose individuals’ health status is heterogeneous, how might one summarize the heterogeneous health outcomes in that population to arrive at a single, scalar measure of that population’s “health”? Some of the fundamental considerations that arise in measurement science (see Stevens, 1946, for a classic treatment) are clearly useful to bear in mind when addressing each of these topics, but such considerations are not in the purview of this article.

It is important to note at this juncture that this article’s treatment is not exhaustive. While the discussion raises some generic issues and concerns, the discussion here spans only a small fraction of health economists’ and others’ efforts in these domains. See Manning, Newhouse, and Ware (1982) for a definitive treatment of several fundamental health-measurement issues relevant in economic analysis and McDowell and Newell (1996) for a book-length treatment of a variety of technical issues involved in survey-based health status measurement. Many important health measurement topics not covered or covered only in passing here include the use of surrogate health measures (e.g., Institute of Medicine, 2012), health-related quality-of-life metrics (e.g., Mullahy, 2001), measurement error (e.g., Bollinger & Chandra, 2005), and the importance of social and environmental contexts in measuring health (e.g., Hausman, 2015).

The purpose here is to explore the importance of health status measurement in three areas of concern in applied health economics hinted at in the previous paragraph. All three themes involve how individual-level health data are measured or summarized in or across different dimensions: health measurement in time, health measurement over multiple outcomes, and health measurement across heterogeneous populations.

Health Measurement in Time

Many discussions of health status in health economics correctly begin with Grossman’s (1972) model of the demand for health. This work is best known for its emphasis on health as a valued commodity that cannot be purchased on markets, for its development of the notion of investment in health capital stocks as a special form of human capital, and for its careful assessment of the roles of medical care and schooling in the demand-investment framework.

A less emphasized feature of Grossman’s framework is its careful distinction between health capital and health status. Health capital, a stock, is at time “i” generated via some investment-depreciation process Hi+1Hi=IiδiHi, with investment “production” given by Ii=Ii(Mi,THi;Ei), where M represents medical care, TH is time devoted to health capital investment (including time spent engaged in medical care), and E is schooling or some other feature of non-health human capital. Health status, conversely, is a flow that in Grossman’s framework is at any time “i” proportional to health capital stocks, hi=ϕiHi (analogous to how income flows, e.g., derive from wealth stocks).

As a flow, health status is necessarily defined by some time frame or time denomination. In Grossman’s framework, this translates into considering poor health or sickness as a particular form of time use within individuals’ overall time budgets:

The time constraint requires that Ω, the total amount of time available in any period, must be exhausted by all possible uses: TWi+TLi+THi+Ti=Ω, where TLi is time lost from market and nonmarket activities due to illness or injury. . . .

If sick time were not added to market and nonmarket time, total time would not be exhausted by all possible uses. My model assumes that TLi is inversely related to the stock of health; that is, TLi/Hi<0. If Ω were measured in days (Ω=365 days if the year is the relevant period) and if ϕi were defined as the flow of healthy days per unit of Hi, hi would equal the total number of healthy days in a given year. Then one could write TLi=Ωhi. (Grossman, 1972, p. 227)

It may or may not be important to draw a distinction between health capital and healthy time (i.e., “health status”) in a particular application. However, some measures of health elicited via familiar survey questions will often be difficult to interpret because of the lack of correspondence to either health capital or healthy time. For instance, compare a survey question: “In general, would you say your health is EVGFP?” with an alternative: “Over the past week, would you say your health is EVGFP?” The former version provides no time-frame anchor within which a respondent can interpret unambiguously what is meant by “in general”; as such it is challenging to interpret a response to such a question as a measure of healthy time or a health flow. Manning et al. (1982) provide compelling arguments for the importance of time-denomination in health status measurement (see Mullahy, 2016, for additional discussion).

The notion that healthiness is a flow corresponding to some form of time use or time allocation within a given time budget or time frame is both conceptually—in the context of Grossman’s model—as well as intuitively appealing. Familiar self-reported or self-rated health status (SRHS) measures in many standard population surveys are perhaps the most prominent examples of such issues. As noted above, the survey question “In general, would you say your health is EVGFP?” commonly encountered in health and social surveys lacks a particular time anchor. Other frequently encountered SRHS measures, however, do specify time windows within which respondents should evaluate their health.

The notion of defining healthiness via time allocation within a predetermined time window also appears in the context of disease-specific clinical outcomes wherein patients’ allocations of time are fundamentally defined by their health status. For instance, in the context of a randomized trial comparing deep-brain stimulation and medication against medical management for patients suffering from Parkinson’s symptoms (Deuschl et al., 2006), the main outcome studied was a multidimensional time allocation measure. The dimensions of clinical interest are (1) mobile with troublesome dyskinesias, (2) mobile without troublesome dyskinesias, (3) neither fully mobile nor fully immobile, (4) immobile, and (5) sleeping. The first and fourth measures are considered negative outcomes, the second and fifth positive outcomes, and the third an ambiguous outcome. The measurement strategy used by Deuschl et al. (2006) guarantees that these time uses add up to 24 hours in one day. Analogous measurement issues are used in the context of evaluating the effectiveness of interventions for cardiovascular and cerebrovascular diseases. For example, O’Brien et al. (2016) examine 90-day and one-year post-discharge variations in time at home—alive out of a hospital, inpatient rehabilitation facility, or skilled nursing facility—for acute ischemic stroke patients.

Health Measurement With Multiple Outcomes

Circumstances where multiple measures of health and health-related outcomes are simultaneously of interest arise regularly in health economics. The manner in which multiple measures of health are presented to analysts differs greatly across contexts. Yet regardless of context, since many conceptualizations of health status recognize its multifaceted or multidimensional character, it is natural to consider whether such multidimensionality is important to address in health economics studies and, if so, how to do so.

Some important examples of multidimensional health status concern diagnostic or screening scales or scores that have been derived on the basis of M specific survey or screening “items” (in some contexts termed “multiple indicators”) or multiple outcome or process indicators of patient-level healthcare quality from which inferences concerning overall patient-level care quality are to be drawn. Many other such circumstances can be readily imagined within health economics, and others having conceptually analogous structures arise broadly in other fields beyond health economics (e.g., poverty measurement, food security, portfolio composition). Regardless of the particular analytical or decision-making context, common issues arise when a researcher must accommodate simultaneously multiple measures of health in a particular analytical setting. This is true regardless of whether the multiple health measures are “predicting” (i.e., “on the right-hand side”) or “being predicted” (i.e., “on the left-hand side”). At an abstract level one might consider an M-dimensional health measure describing individual j’s health status, h=hj=[h1j,...,hMj]. At this juncture the only relevant feature to be noted about the population joint distribution hF(h) is that the elements of h are unlikely to be mutually independent. Whether such dependence arises from some common-factor structure or for other more general biological reasons, the dependence across the elements of h will typically be a relevant consideration. While the previous section considered a special case of multiple health outcomes (time fractions), the discussion in this section concerns a broader class of multiple health-outcome scenarios.

Multiple Health Measures as Predictors

The challenge of understanding multiple health outcomes as RHS variables or “predictors” is a familiar one arising in many contexts in health economics. In these circumstances, a general consideration is estimation of an econometric model like y=g(h,x,u). Such a model would be familiar in settings involving risk adjustment, risk equalization, utilization modeling, and others.

A prominent concern in such exercises is how to specify the role of the M-dimensional measure of health for reasons of conceptual and statistical parsimony, statistical power, predictive performance, avoidance of overfitting, etc. For concreteness, consider a specification y=g(ϕ(h),x,u) in which ϕ(h) specifies how the M elements of h are included in a particular econometric model. ϕ(h) may be dimension-expanding, for example, where ϕ(h) describes main and second-order or interaction effects in and across the elements of h. Conversely, ϕ(h) may be dimension-reducing, for example, where ϕ(h) is some aggregator function mapping h into a scalar or some lower-dimensioned summary of the elements of h. In other cases ϕ(h) may be simultaneously dimension-reducing and dimension-expanding (e.g., aggregation of detailed health status indicators to a lower-dimensioned vector, then inclusion of main and higher-order effects of these aggregated metrics). See Brilleman et al. (2014), Pope et al. (2004), and Wherry, Burns, and Leininger (2014) for discussions of various considerations arising in such settings.

Ultimately the key issue with respect to the econometrician’s specification of ϕ(h) —that is, with respect to the measurement of health status in a particular analytical context—is how well it squares with the true model. Suppose the true model is y=g(ψ(h;γ),x,u) but the econometrician specifies y=g(ϕ(h;α),x,u). When ψ(h) is nested within ϕ(h), that is, ϕ(h)=[ψ(h),ω(h)], then standard econometric considerations of inclusion of irrelevant RHS variables will generally be applicable; in many such instances specification using ϕ(h) will introduce variance but may not introduce bias with respect to the estimates of the elements of γ.

Different considerations arise when ϕ(h) is a dimension-reducing aggregation of h as will often be the case in applied work. For example, a common approach is to use as a RHS covariate a scalar “score” obtained as the sum or weighted sum of binary or ordered health status measures, Sj=m=1Mwmhmj, and then to specify ϕ(h) as ϕ(h)=Sα, where α is scalar. Alternatively, one might define a binary scalar “diagnosis” predictor based on S, for example, Dj=1(Sjd), and specify accordingly ϕ(h)=Dα (see Meigs et al., 2008, for an application in the context of genotype risk scores). How estimates of the scalar α relate to the true vector of parameters γ is challenging to assess in general. In any event, the ultimate concern is the extent to which the econometrician’s measure of health status squares with the measure(s) of health status that are truly predictive of the outcomes y.

Multiple Health Measures as Outcomes

While circumstances arise frequently in which considerations of how to treat multiple health status measures as outcomes are prominent, the contexts may be quite varied. Such concerns range from ones involving multiple comparisons and familywise errors (e.g., Romano & Wolf, 2005), to others involving composite endpoints (e.g., Montori et al., 2005; U.S. FDA, 2017), to yet others involving more generally the specification of health status patterns or aggregates to be explained.

Despite the variety of these considerations, many such issues can be addressed in the context of an overarching statistical framework. Conditional on x, suppose h follows some population joint distribution hF(h|x). F(h|x), of course, embodies all relevant information about the jointness properties of the elements of h, given x. When presented with an empirical problem involving F(h|x), analysts can pursue at least three fundamentally different strategies.

First, they may model separately the M marginal distributions, F(hm|x), or functionals thereof, V(F(hm|x)) (e.g., moments, quantiles), m=1, . . . ,M, and then explore one-at-a-time properties of each of the M separate estimates. Notwithstanding multiple comparisons considerations in formulating inferences based on such approaches, there is nothing problematic per se in pursuing such a strategy so long as it addresses the decision-maker’s questions at hand. It is not uncommon, for instance, to see in a published health economics article a table of regression results in which the names of M health measures define column headings and the names of K regressors define the row headings.

A second approach is to respect the jointness properties of F(h|x), model the full joint conditional distribution of the M health status measures, and focus directly on the resulting estimates as the objects of interest. For instance, suppose all M health status measures hm are binary so that there are 2M possible patterns kp of the vector h. In this second approach, the objects of interest might be: the 2M joint probabilities themselves, Pr(h=kp|x), p=1, . . . ,2M; marginal effects thereof, Pr(h=kp|x)/x; or other quantities related thereto. Fundamentally it is the patterns of h outcomes per se that are of interest in pursuing this approach.

A third strategy is to appeal to some dimension-reducing aggregation of h, A(h), as the LHS health status measure to be explained, with scalar or univariate aggregates of greatest interest. In many instances in applied work, the aggregates A(h) have the structure of some form of counting measure. Moreover, variants of this strategy encompass several aggregation strategies that are in common use—and sometimes of great debate—in applied research. For instance, suppose the M health status measures hm are binary or ordered, hm{0,...,(Cm1)}, where Cm=2 if hm is binary. Letting Ω=m=1M(Cm1), define the (unweighted) aggregate count or sum S=m=1Mhm, so that S{0,...,Ω} in general and S{0,...,M} when all hm are binary. Health status measures like S or weighted variants thereof arise as outcome measures across a wide variety of health-related settings: Apgar scores for neonatal health assessment (AAP, 2006); PHQ-2 scores for depression screening (Fleishman et al., 2014); widespread pain scores for pain intensity assessment (Hunt et al., 1999); multiple chronic conditions measures in population health contexts (AHRQ, 2014); and many others.

Beyond the counts S per se, health status measures related to S arise frequently in health research. For instance, it is often the case that what might be termed “diagnoses” based on S are of central interest. For a given S and a given set of d thresholds, τ={t0,...,td1}, define the diagnosis measure D as:


That is, the diagnosis measure Dτ is a coarsened (perhaps binary) representation of S. Such health status “diagnoses” arise across a broad span of clinical and population health contexts. In the examples noted above, Apgar scores less than 7 and less than 4 on the 10-point (M=5, Cm=C=3) scale are suggestive of moderate and high levels of neonate distress, while PHQ-2 scores above 2 on the 6-point (M=2, Cm=C=4) scale are suggestive of a depression diagnosis. In another context, the DSM-V uses an M=11 battery of items, with d=3 for substance use disorder diagnoses: none, mild, moderate, and severe. It should be noted that in the simplest case with a single threshold t0, the resulting diagnosis measure corresponds to the “dual cutoff” metric suggested by Alkire and Foster (2011) in the context of multiple-deprivation poverty measures.

In practice, there arise several important special types of diagnoses. If all hm are binary, define DANY=1(S>0) and DALL=1(S=M). DANY corresponds to a standard definition of a composite (or “or” or “union”) health status measure in which the binary diagnosis of health status is positive when any of the M component measures is positive. Composite measures are widely used in practice. For instance, the composite outcome measure specified in Look AHEAD Research Group (2013) is representative of the kinds of composite measures one encounters in the clinical literature: death from cardiovascular causes, nonfatal myocardial infarction, nonfatal stroke, or hospitalization for angina. Yet the use of composite health status measures is a topic of some debate in areas like clinical evaluation (e.g., Montori et al., 2005). Some of this debate is the result of a lack of conceptual clarity—terms like “lump” have been used to characterize endpoints thus measured—while other features of the debate arise from concerns about drawing valid inferences when such outcomes are analyzed empirically.

DALL, conversely, corresponds to a standard definition of an “all-or-none” (or “and” or “intersection”) health status measure in which the binary health status diagnosis is positive only when each of the M component measures is positive. For instance, all-or-none standards have been advocated as stringent quality criteria in the clinical quality arena (Nolan & Berwick, 2006) and have been adopted in a variety of healthcare settings (e.g., the diabetes subdomain of the Medicare ACO quality metrics). In some instances in the clinical literature, the DALL measure is referred to as a co-primary endpoint.

Regardless of the particular definition of Dτ, it is important to note that it inherits all its stochastic properties from that of S, which in turn are inherited from those of h and x. To this end, assume the binary/ordered hm are coarsened (threshold-crossed) versions of continuous latent variables hm*, with h*=[h1*,...,hM*]F*(h*|x) conditional on x. In applications it might often be assumed that F*(...) is multivariate normal, MVN(xB,R), although other distributional assumptions are certainly possible. The observed, coarsened versions of the hm* are distributed h=[h1,...,hM]F(h|x), and from this joint distribution are inherited the stochastic structures of FS(S|x) and FD(Dτ|x). Given these stochastic structures, one may then proceed to estimation of and inference about functionals of FS(S|x) and/or FD(D|x) that may be of interest for decision-making, for instance, E[S|x], Pr(SS|x), or Pr(DD|x).

Health Measurement Across Heterogeneous Populations

Heterogeneity in population attributes remains a fact of life, so how one elects to summarize features of a heterogeneous population certainly matters in practice and may matter insofar as what conclusions and inferences one might draw regarding how such outcomes are determined. This section assesses several prominent analytical, conceptual, and practical issues arising when health status data characterizing individuals in heterogeneous populations must be summarized for purposes of decision-making (Mullahy, 2001, 2017). Statistical summarization of health status data is such a routine exercise in applied health economics that often little thought is accorded to the implications of electing to summarize heterogeneous data in one way versus another. Yet it is demonstrated by example that the manner in which such data are summarized can—and, indeed, should—affect real-world decision-making.

Such themes are first-order considerations in the econometric treatment-effect literature, for instance (see Imbens & Wooldridge, 2009), and of necessity become relevant in a broad spectrum of regulatory—FDA, EPA, OSHA, etc., in the United States—and other public policy domains (Manski, 2013). While devising and implementing so-called “data summaries” may not be a foremost consideration for many analysts and decision-makers, applications cannot proceed without explicitly or (often the case) implicitly making choices—informed ideally by reference to decision-makers’ value or loss functions—about which summary or summaries should be made. That is, the decision to use a particular data summary to characterize outcomes in a heterogeneous population distribution is tantamount to the specification of a particular statistical loss or decision function in which expected loss is minimized by selecting one parameter to minimize the corresponding expected loss criterion (Manski, 2007). Well-known cases are means minimizing quadratic loss, medians minimizing absolute loss, and quantiles minimizing asymmetric absolute loss.

“Averages are the meat of statisticians, where ‘average’ here is understood in the wide sense of any summary statement about a large population of objects” (Efron, 1978; italics added). The concern here is about summary statements or “data summaries” in the sense described by Efron. Means, probabilities, quantiles, etc., including conditional and/or joint versions thereof, are the kinds of health-status data summaries of interest here. The particular context of concern is suggested by considering a univariate health status outcome measure h that is determined by h=h(x,u), where x summarizes exogenous covariates, treatments, etc. Population heterogeneity in u will in general induce population heterogeneity in h, resulting in some population distribution of health status F(h|x). When one hears or reads in an advertisement for a prescription drug, for instance, that “individual results may vary,” that is the idea of main concern here.

In the sense used here, a data summary is a statistic or a functional on the distribution F(h|x), V(F(h|x)), like a moment, percentile, or probability. Such summaries are, of course, the main tools of economists’ empirical trade. Much of the field’s empirical work involves estimation of particular functionals (e.g., V(F(h|x))=E[h|x] via least-squares regression), estimation of treatment effects via comparison of functionals of two (possibly counterfactual) distributions—for example, contrasting V(F1(h)) vs. V(F0(h)) in an evaluation exercise—as well as many other possibilities. No such exercise is possible, of course, without the specification of V(...); whether such specification is explicit or is instead implied from specification of a particular estimand—for example, a mean implied from minimizing quadratic loss in some least-squares regression—is a separate matter.

It should be noted that not all interesting evaluation exercises involve contrasts between functionals of marginal distributions, V(F1(h)) vs. V(F0(h)). Stochastic dominance comparisons are one obvious exception, and distributions of treatment effects defined on the basis of possibly non-identified joint distributions of multiple outcomes feature prominently in the treatment-effect literature (see Imbens & Wooldridge, 2009). For example, Mullahy (2017), using results from Fan and Park (2010), considers derivation of Fréchet-Boole bounds on “inequality probabilities” arising from outcomes’ joint distributions. The inequality probability, Pr(h1>h0), provides an intuitive characterization of the extent to which one outcome is stochastically larger than another, which can be appreciated from its definition, Pr(h1>h0)=h0f(h0,h1)dh1dh0, wherein f(...) is the joint probability density. Pr(h1>h0) is sometimes referred to as “fraction who benefit.”

Such joint-distribution considerations notwithstanding, suppose that in an evaluation comparing F(h|x0) with F(h|x1), an analyst finds that median(F(h|x1))>median(F(h|x0)), E[h|x0]>E[h|x1], Prob(h>hc|x0)>Prob(h>hc|x1), etc. What should one conclude then as to whether x0 or x1 is “better”? Perhaps the only case where the choice of functional will not matter for adjudicating “better” is that of zero-order stochastic dominance (Castagnoli, 1984). Apart from this special case, the choice of V(...) may matter for decision-making. While much emphasis is (properly) placed on defining and measuring relevant individual-level measures of h and on obtaining estimates of the F(h|xj) having desirable statistical properties, thoughtful choice of V(...) reflecting whatever decision-maker loss functions may be in play ought not be ignored as a fundamental part of the exercise.

Health Data Summaries in Policy and Regulatory Settings

As a thought experiment, consider answering this question: What is the temperature of Earth? At any point in time, the answer to such a question figures prominently in policy discussions regarding global warming. Yet there is clearly no single temperature that fully characterizes the entirety of the planet. Instead, the temperature data measured at over 1,200 globally dispersed climate-monitoring stations are (after seasonal and other adjustments) summarized into a single metric that is proposed to represent our entire planet’s temperature. Why is the data summary used to define “Earth’s temperature” chosen?

For many policy, regulatory, or other decision-making purposes closer to the realm of health economics, one must summarize health status across heterogeneous populations. Such exercises are prominent in a variety of “treatment” evaluation contexts. For instance, a sponsor of some new, purportedly health-enhancing, medical technology may wish to claim something like:




What is the statistical basis of such a claim? How are the health status data advanced in support of such a claim (e.g., from a randomized trial) being summarized as evidence for that claim? Why do the sponsor’s data analysts summarize their data in this particular manner?

It is suggested here that while the method of data summary—the choice of functional—used in any particular empirical exercise may in some cases be essentially ad hoc, in other instances there are regulatory structures in place that may dictate to some lesser or greater degree the manner in which analysts choose to summarize their data. If choice of the mode of health status data summarization was the only implication of such regulatory structures, then the question might be little more than academic in nature. But when the results of analyses conducted using particular data summaries in turn have implications for downstream resource allocation or policymaking decisions, the importance of data summary methods is raised to a potentially significantly higher level.

Whether for devices, biologics, or drugs, wherefrom do such criteria or standards arise? A sponsor’s purported effects will typically not be ad hoc but will have some ostensibly reasonable rationale. In essence, technologies’ sponsors do not typically formulate their own independent metrics of what constitutes “effectiveness” but instead rely on some combination of criteria that may be suggested by FDA or other relevant regulatory authorities. The data summaries or statistical functionals that arise correspondingly will, in general, not be selected on the basis of minimization of some decision-maker’s expected loss but will instead be governed significantly by criteria advanced by FDA, which often have at best the flavor of having a biostatistical rationale; in other instances, the origins of such criteria appear to be frighteningly ad hoc.

Considerations of and decisions regarding how heterogeneous individual health status (“effectiveness”) data are to be summarized in RCTs or other settings may arise at various stages: clinical trial registration and reporting in clinicaltrials.cov (Zarin et al., 2011); pre-IND and pre-NDA consultations between sponsors and FDA staff (during which, among other things, “appropriate methods for statistical analysis of the data” may be discussed); specification of recommended outcome or endpoint measures as detailed in FDA Guidance documents; the design and conduct of Phase 2 and Phase 3 trials; integrated summaries of effectiveness (ISEs) presented by sponsors as a component of the evidence of effectiveness; deliberations of FDA Advisory Committees; and others. At any of these stages may arise considerations of and decisions regarding how individual-level data on (say) trial subjects is to be not only measured but also summarized—an essential though often underappreciated feature of health-related evaluations.


High-quality, high-impact empirical research in health economics requires taking seriously the measurement of health. This article has offered perspectives on how such measurement issues arise and might be addressed in three contexts of interest in some health economics research. At a general level health measurement issues are far from resolved, and it is hoped that considerable thought and effort continue to be invested in this important agenda.


Thanks are owed to participants at the 2016 HESG Winter Meetings in Manchester and the 2016 Will Manning Memorial Conference at the University of Chicago for helpful comments and discussions. Financial and logistical support for some aspects of the work reported here has been provided by an Evidence for Action grant 73336 from the Robert Wood Johnson Foundation, by the Health & Society Scholars Program at UW–Madison, and by NICHD Grant P2C HD047873 to the UW–Madison Center for Demography and Ecology.


Agency for Healthcare Research and Quality (AHRQ). (2014). Multiple chronic conditions chartbook. Rockville, MD: AHRQ.Find this resource:

Alkire, S., & Foster, J. (2011). Counting and multidimensional poverty measurement. Journal of Public Economics, 95, 476–487.Find this resource:

American Academy of Pediatrics, Committee on Fetus and Newborn; American College of Obstetricians and Gynecologists and Committee of Obstetric Practice. (2006). The Apgar Score. Pediatrics, 117, 1444–1447.Find this resource:

Bollinger, C. R., & Chandra, A. (2005) Iatrogenic specification error: A cautionary tale of cleaning data. Journal of Labor Economics, 23, 235–257.Find this resource:

Brilleman, S. L., Gravelle, H., Hollinghurst, S., Purdy, S., Salisbury, C., & Windmeijer, F. (2014). Keep it simple? Predicting primary health care costs with clinical morbidity measures. Journal of Health Economics, 35, 109–122.Find this resource:

Burns, M., & Mullahy, J. (2016). Healthy-time measures of health outcomes and healthcare quality. NBER Working Paper 22562.Find this resource:

Bynum, J. P. W., Meara, E., Chang, C.-H., & Rhoads, J. M. (2016). Our parents, ourselves: Health care for an aging population. Lebanon, NH: The Dartmouth Institute.Find this resource:

Castagnoli, E. (1984). Some remarks on stochastic dominance. Revista di Matematica per le Scienze Economiche e Sociali, 7, 15–28.Find this resource:

Centers for Medicare and Medicaid Services. (2015). The CMS equity plan for improving quality in medicare. Baltimore: CMS, Office of Minority Health.Find this resource:

Cravens, S. M. R. (2002). The usage and meaning of “clinical significance” in drug-related litigation. Washington and Lee Law Review, 59, 553–597.Find this resource:

Deuschl, G., Schade-Brittinger, C., Krack, P., Volkmann, J., Schäfer, H., Bötzel, K., . . . Voges, J. (2006). A randomized trial of deep-brain stimulation for Parkinson’s disease. New England Journal of Medicine, 355, 896–908.Find this resource:

Edwards, S. T., & Landon, B. E. (2014). Medicare’s chronic care management payment—payment reform for primary care. New England Journal of Medicine, 371, 2049–2051.Find this resource:

Efron, B. (1978). Controversies in the foundations of statistics. American Mathematical Monthly, 85, 231–246.Find this resource:

EuroQol Group. (2009). EQ-5D-5L health questionnaire—English version for the U.K. Rotterdam: EuroQol Group.Find this resource:

Fan, Y., & Park, S. S. (2010). Sharp bounds on the distribution of treatment effects and their statistical inference. Econometric Theory, 26, 931–951.Find this resource:

Fleishman, J. A., Zuvekas, S. H., & Pinkus, H. A. (2014). Screening for depression using the PHQ-2: Changes over time in conjunction with mental health treatment. AHRQ Working Paper 14002. Rockville, MD: AHRQ.Find this resource:

Grossman, M. (1972). On the concept of health capital and the demand for health. Journal of Political Economy, 80, 223–255.Find this resource:

Hausman, D. (2015). Valuing health—well-being, freedom, and suffering. Oxford: Oxford University Press.Find this resource:

Hunt, I. M., Silman, A. J., Benjamin, S., McBeth, J., & Macfarlane, G. J. (1999). The prevalence and associated features of chronic widespread pain in the community using the “Manchester” definition of chronic widespread pain. Rheumatology, 38, 275–279.Find this resource:

Imbens, G. W., & Wooldridge, J. M. (2009). Recent developments in the econometrics of program evaluation. Journal of Economic Literature, 47, 5–86.Find this resource:

Insitute of Medicine. (2010). Evaluation of Biomarkers and Surrogate Endpoints in Chronic Disease. Washington, DC: National Academies Press.Find this resource:

Institute of Medicine. (2012). Living well with chronic illness: A call for public health action. Washington, DC: National Academies Press.Find this resource:

Look AHEAD Research Group. (2013). Cardiovascular effects of intensive lifestyle intervention in type 2 diabetes. New England Journal of Medicine, 369, 145–154.Find this resource:

Lynn, J., McKethan, A., & Jha, A. K. (2015). Value-based payments require valuing what matters to patients. Journal of the American Medical Association, 314, 1445–1446.Find this resource:

Machlin, S. R., & Soni, A. (2013, April 25). Health care expenditures for adults with multiple treated chronic conditions: Estimates from the Medical Expenditure Panel Survey, 2009. Preventing Chronic Disease, 10.Find this resource:

Manning, W. G., Newhouse, J. P., & Ware, J. E., Jr. (1982). The status of health in demand estimation; or, beyond excellent, good, fair, and poor. In V. Fuchs (Ed.), Economic aspects of health (pp. 143–184). Chicago: University of Chicago Press for NBER.Find this resource:

Manski, C. F. (2007). Identification for prediction and decision. Cambridge, MA: Harvard University Press.Find this resource:

Manski, C. F. (2013). Public policy in an uncertain world: Analysis and decisions. Cambridge, MA: Harvard University Press.Find this resource:

Mattke, S., Higgins, A., & Brook, R. (2015). Results from a national survey on chronic care management by health plans. American Journal of Managed Care, 21, 370–376.Find this resource:

McDowell, I., & Newell, C. (1996). Measuring health: A guide to rating scales and questionnaires. Oxford: Oxford University Press.Find this resource:

Medicare Payment Advisory Commission (MedPAC). (2015a). Improving care for medicare beneficiaries with chronic conditions. Testimony of MEDPAC Executive Director Mark E. Miller, Senate Finance Committee, May 14, 2015. Washington, DC: MedPAC.Find this resource:

Medicare Payment Advisory Committee (MedPAC). (2015b). Next steps in measuring quality of care in Medicare. Chapter 8 in June 2015 report to the Congress: Medicare and the health care delivery system. Washington, DC: MedPAC.Find this resource:

Meigs, J. B., Shrader, P., Sullivan, L. M., McAteer, J. B., Fox, C. S., Dupuis, J., . . . Cupples, L. A. (2008). Genotype score in addition to common risk factors for prediction of type 2 diabetes. New England Journal of Medicine, 359, 2208–2219.Find this resource:

Montori, V. M., Permanyer-Miralda, G., Ferreira-González, I., Busse, J. W., Pacheco-Huergo, V., Bryant, D., . . . Guyatt, G. H. (2005). Validity of composite end points in clinical trials. British Medical Journal, 330, 594–596.Find this resource:

Mullahy, J. (2001). Live long, live well: Quantifying the health of heterogeneous populations. Health Economics, 10, 429–440.Find this resource:

Mullahy, J. (2016). Time and health status in health economics. Health Economics, 25, 1351–1354.Find this resource:

Mullahy, J. (2017). Individual results may vary: Elementary analytics of inequality-probability bounds, with applications to health-outcome treatment effects. Working paper, University of Wisconsin-Madison.Find this resource:

Nolan, T., & Berwick, D. M. (2006). All-or-none measurement raises the bar on performance. Journal of the American Medical Association, 295, 1168–1170.Find this resource:

O’Brien, E. C., Xian, Y., Xu, H., Wu, J., Saver, J. L., Smith, E. E., . . . Fonarow, G. C. (2016). Hospital variation in home-time after acute ischemic stroke. Stroke, 47, 2627–2633.Find this resource:

Pope, G. C., Kautter, J., Ellis, R. P., Ash, A. S., Ayanian, J. Z., Iezzoni, L. I., . . . Robst, J. (2004). Risk adjustment of Medicare capitation payments using the CMS-HCC model. Health Care Financing Review, 25, 119–141.Find this resource:

Porter, M. E., Larsson, S., & Lee, T. H. (2016). Standardizing patient outcome measures. New England Journal of Medicine, 374, 504–506.Find this resource:

Romano, J., & Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73, 1237–1282.Find this resource:

Rosenzweig, M. R., & Schultz, T. P. (1983). Estimating a household production function: Heterogeneity, the demand for health inputs, and their effects on birth weight. Journal of Political Economy, 91, 723–746.Find this resource:

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.Find this resource:

Stinnett, A. A., & Mullahy, J. (1998). Net health benefits: A new framework for the analysis of uncertainty in cost-effectiveness analysis. Medical Decision Making, 18, S68–S80.Find this resource:

Tinetti, M. E., & Studenski, S. A. (2011). Comparative effectiveness research and patients with multiple chronic conditions. New England Journal of Medicine, 364, 2478–2481.Find this resource:

U.S. Department of Health and Human Services. (2010). Multiple chronic conditions: A strategic framework. optimum health and quality of life for individuals with multiple chronic conditions. Washington, DC: DHHS.Find this resource:

U.S. Food and Drug Administration. (2009). Guidance for industry: Patient-reported outcome measures: Use in medical product development to support labeling claims. Rockville, MD: CDER, CBER, CDRH, Food and Drug Administration.Find this resource:

U.S. Food and Drug Administration. (2012). FDA executive summary memorandum. P100049, Linx Reflux Management System, Torax. Prepared for the January 11, 2012, Meeting of the Gastroenterology and Urology Devices Advisory Panel. Rockville, MD: Food and Drug Administration.Find this resource:

U.S. Food and Drug Administration. (2015). Guidance for industry and review staff: Best practices for communication between IND sponsors and FDA during drug development. Rockville, MD: CDER, CBER, Food and Drug Administration.Find this resource:

U.S. Food and Drug Administration. (2017). Multiple endpoints in clinical trials—Guidance for industry. Rockville, MD: USDHHS/FDA CBER/CDER.Find this resource:

Wherry, L. R., Burns, M. E., & Leininger, L. J. (2014). Using self-reported health measures to predict high-need cases among Medicaid-eligible adults. Health Services Research, 49, 2147–2172.Find this resource:

Zarin, D. A., Tse, T., Williams, R. J., Califf, R. M., & Ide, N. C. (2011). The results database—Update and key issues. New England Journal of Medicine, 364, 852–860.Find this resource: