Show Summary Details

Page of

date: 09 December 2019

# Measuring Health Utility in Economics

## Summary and Keywords

Quality-adjusted life years (QALYs) are one of the main health outcomes measures used to make health policy decisions. It is assumed that the objective of policymakers is to maximize QALYs. Since the QALY weighs life years according to their health-related quality of life, it is necessary to calculate those weights (also called utilities) in order to estimate the number of QALYs produced by a medical treatment. The methodology most commonly used to estimate utilities is to present standard gamble (SG) or time trade-off (TTO) questions to a representative sample of the general population. It is assumed that, in this way, utilities reflect public preferences. Two different assumptions should hold for utilities to be a valid representation of public preferences. One is that the standard (linear) QALY model has to be a good model of how subjects value health. The second is that subjects should have consistent preferences over health states. Based on the main assumptions of the popular linear QALY model, most of those assumptions do not hold. A modification of the linear model can be a tractable improvement. This suggests that utilities elicited under the assumption that the linear QALY model holds may be biased. In addition, the second assumption, namely that subjects have consistent preferences that are estimated by asking SG or TTO questions, does not seem to hold. Subjects are sensitive to features of the elicitation process (like the order of questions or the type of task) that should not matter in order to estimate utilities. The evidence suggests that questions (TTO, SG) that researchers ask members of the general population produce response patterns that do not agree with the assumption that subjects have well-defined preferences when researchers ask them to estimate the value of health states. Two approaches can deal with this problem. One is based on the assumption that subjects have true but biased preferences. True preferences can be recovered from biased ones. This approach is valid as long as the theory used to debias is correct. The second approach is based on the idea that preferences are imprecise. In practice, national bodies use utilities elicited using TTO or SG under the assumptions that the linear QALY model is a good enough representation of public preferences and that subjects’ responses to preference elicitation methods are coherent.

# Introduction: The Standard QALY Model

## Quality-Adjusted Life Years as a Measure of Individuals’ Preferences

Using economic evaluation to inform health resources allocation decisions requires health outcomes to be measured and valued along with costs. The outcome of any health technology (a surgical procedure, a pharmaceutical therapy, an organizational program, a diagnosis technique, etc.) may be described in terms of an extension of life expectancy and/or an improvement in quality of life. Hence, measures of health outcomes should ideally enclose both dimensions of healthcare benefits.

The quality-adjusted life year (QALY) is a metric originally proposed in the field of epidemiology by Fanshel and Bush (1970) with the name “Health Status Index,” which combines length of life and quality of life (i.e., mortality and morbidity) into a single measure. A QALY is defined as the value of living one year in full health. Therefore, living one year in a health state (qi) distinct from full health is valued less than one QALY. When QALYs are calculated, years of life are weighted according to the quality of life—or, more precisely, the health-related quality of life (HRQOL)—in which those years are lived. Given an individual’s health profile, described as a sequence of periods of time $(t1,t2,. . .,tn)$, each of them lived in a certain health state $(q1,q2,. . .,qn)$, the value attached to that profile (i.e., the number of QALYs) would be

$Display mathematics$
(1)

where $v(qi)$ is the HRQOL weight or the “utility” corresponding to state qi, on a 0 to 1 scale, with 1 representing full health and 0 representing death—values of $v(q)$ below 0 are allowed when health states are regarded as worse than being dead.

Individual preferences on health outcomes must fulfill certain conditions to be properly represented under Equation (1). First, there are three conditions that guarantee that individual preferences over a chronic health profile (i.e., when the health state remains constant over time) may be represented by $V(q,t)=v(q)t$. These three conditions are, according to Pliskin, Shepard, and Weinstein (1980), mutual utility independence, constant proportional trade-off, and risk neutrality.

Mutual utility independence relates to the separability of q and t in the valuation of a health profile. The value assigned to t years in state $q,V(q,t)$, can be decomposed into $v(q)$ and $s(t)$. This implies that the utility of certain health state, $v(q)$, will be the same, regardless of the time an individual lives in that health state. For this reason, the decomposition $V(q,t)=v(q)s(t)$ is referred to as the multiplicative QALY model (Abellán-Perpinán, Pinto-Prades, Méndez, & Badía, 2006).

Constant proportional trade-off relates to the shape of the function $s(t)$. The QALY model shown in Equation (1) assumes that $s(t)=t$ and because of that it is known as the linear QALY model. However, the constant proportional trade-off condition only requires that $s(t)=tr$, with $r>0$. This implies that, for any duration, the relative value of two health states will remain constant.

Risk neutrality implies that preferences are symmetric on time; that is, subjects are indifferent with respect to the period of time in which they experience a certain health state, because every year in a lifetime is worth the same. This condition requires that in

$Display mathematics$

Two additional conditions must be met to ensure that the additive model in Equation (1) is a good representation of individual preferences over nonchronic health profiles: additive independence and symmetry.

Basically, additive independence means that the utility function is additive, which implies that the value of a health profile, consisting of a sequence of time periods in different health states, is simply the sum of the values of every time period (Fishburn, 1965; Keeney & Raiffa, 1976).

The condition of symmetry requires that the valuation of a health outcome $(q,t)$ is the same regardless of the precise moment of time when that outcome is suffered or enjoyed (Abellán-Perpinán & Pinto-Prades, 2000; Bleichrodt & Quiggin, 1997). Therefore, this condition rules out the possibility of adaptation to live in a poor health state (Groot, 2000) or the “maximum endurable time” phenomenon—that is, health states become unbearable when they last for a long period of time (Sutherland, Llewelyn-Thomas, Boyd, & Till, 1982).

Additive independence and symmetry may be replaced by an alternative condition proposed by Wakker (1996), namely, additive decomposability over disjoint time periods. This assumption means that a nonchronic health profile can be seen as a sequence of chronic profiles, and its valuation equals the sum of the values of each of the chronic components:

$Display mathematics$
(2)

## Empirical Evidence on the QALY Model’s Assumptions

There is mounting evidence about the validity of the assumptions listed in the previous section (for a summary, see Abellán-Perpinán, Herrero, & Pinto-Prades, 2016). Regarding mutual utility independence, most of the empirical studies conclude that this condition holds for most of the participants (Abellán-Perpinán, Bleichrodt, & Pinto-Prades, 2009; Bleichrodt & Johannesson, 1997; Bleichrodt & Pinto-Prades, 2005; Bleichrodt, Pinto-Prades, & Abellán, 2003; Miyamoto & Eraker, 1988). On the contrary, constant proportional trade-off is rejected in some of the empirical studies aimed to test that assumption. According to Attema and Brouwer (2010), the evidence suggests that with shorter durations relative values tend to be higher. Risk neutrality does not hold in various studies with patients (McNeil, Weichselbaum, & Pauker, 1981; Verhoef, De Haan, & Van Daal, 1994).

The condition of additive decomposability over disjoint time, suggested by Wakker (1996) as an alternative to the assumptions of additive independence and symmetry, may be tested by comparing the valuation of a health profile with the weighted sum of the valuations of each of its components (time periods with different health conditions). This method is known as the “path state approach” and has been used in various studies showing conflicting results. Richardson, Hall, and Salkeld (1996) found evidence against the assumption of additive decomposability, whereas the results in Brazier, Dolan, Karampela, and Towers (2006) and MacKeigan, O’Brien, and Oh (1999) support its validity. Some studies (Abellán-Perpinán et al., 2006; Kupperman, Shiboski, Feeny, Elkin, & Washington, 1997) conclude that the valuation of a nonchronic profile can be derived from the valuations of chronic profiles at an aggregate level but not at the individual level.

In sum, the empirical evidence suggests that the basic assumption of linearity that underlies the standard QALY model may not properly reflect people’s preferences on health outcomes. This could be addressed by assuming an alternative functional form for $s(t)$, the utility function of life years (Abellán-Perpinán et al., 2009; Abellán-Perpinán et al., 2006; Miyamoto & Eraker, 1989). Another way of dealing with nonlinearity is to include discounting in the QALY model (Bleichrodt & Gafni, 1996), which means to assume that each life year may have a different weight in the valuation of a health profile.

However, separability between $v(q)$ and $s(t)$ in the process of evaluating $(q,t)$ still seems to be a plausible assumption on the face of the available evidence. This makes things easier, because measurement of health utilities—that is, valuing $v(q)$—may be addressed separately, and then one can combine these quality weights with the valuation of time in order to calculate the number of QALYs attached to a health profile.

# Methods for Measuring Health Utilities

## Utilities, Values, or Preferences?

The term “utility” is often used as a synonym for “preference.” However, in the literature on health state valuation, utilities are rooted in von Neumann and Morgenstern’s (1944) theory of decision-making under uncertainty. This is why some researchers (e.g., Drummond, Sculpher, Claxton, Stoddart, & Torrance, 2016) recommend that the term “utility” only be used when preferences for health states are measured with methods framed under uncertainty, as the standard gamble (SG), which is later explained. In any other case, the most appropriate term would be “value.” For the sake of simplicity, this article refers to the outcomes of the different methods for measuring health state preferences, that is, the “quality weights” or HRQOL scores used to calculate QALYs, as “utilities,” “values,” or simply “preferences,” indistinctively. Hereafter these utilities are referred to as $U(qi)$.

## Visual Analogue Scale

The simplest method to assign a cardinal value to health states consists of asking subjects to rate health states on a visual analogue scale (VAS) of 0 to 100, where 100 is “the best imaginable health state/condition” and 0 is “the worst imaginable health state/condition” (Torrance, 1986). Given that subjects may consider that certain health states are worse than being dead, the “death state” should also be rated onto the scale. Since zero value is assigned to death in the health utilities scale from 0 to 1, VAS scores need to be transformed in the following way: $U(qi)=[VAS(qi)–VAS(Death)]/[100–VAS(Death)]$.

In the time trade-off (TTO) method (Torrance, 1976; Torrance, Thomas, & Sackett, 1972), subjects are asked to trade years of life in an impaired health state by quality of life improvements. For health states preferred to death, respondents have to state how many years in full health are seen as equivalent to a certain number of years in an impaired health $(qi)$. Subjects are asked to assume that they will live in a poor health for the rest of their lives (t years), and they have to determine the number of future years of life in poor health they are willing to forgo in exchange for fewer years in full health. Therefore, the purpose is to find the duration x, which makes individuals indifferent between living t years in a poor health state and x years in full health, that is:

$Display mathematics$

The utility of $qi$ is calculated by simply dividing the number of years in full health by the life expectancy in poor health: $U(qi)=x/t$. In Figure 1 this equivalence is shown in the areas of the two rectangles with base and height t and $U(qi)$, and x and 1.0, respectively, which are identical.

Click to view larger

Figure 1. Time trade-off method. Better-than-death chronic health states.

Alternative 1: State qi during t years $(qi,t)$. Alternative 2: Full health during x years $(FH,x)$.

When a health sate is regarded as being worse than death, the framing of the TTO method changes. In this case, the scenario is the same as before, that is, a life expectancy of t years in impaired health, but the question is different. Since subjects would rather die immediately than living in the impaired state, they have to state the number (x) of additional years of life in full health that would compensate the burden of living t years in the highly undesirable health condition. That is, to find the duration x that makes individuals indifferent between these two alternatives:

$Display mathematics$

The framing of TTO for worse than death states can be seen in Figure 2, and the utility of qi is obtained as $U(qi)=–x/t$. Unlike “better than death” TTO utilities, which are enclosed between 0 and 1, TTO utilities for “worse than death” states do not have a lower boundary.

Click to view larger

Figure 2. Time trade-off method. Worse-than-death chronic health states.

Alternative 1: Immediate death. Alternative 2: Full health during x years followed by state qi during t years $(FH,x;qi,t)$.

## Standard Gamble

The SG method (Torrance, 1976; Torrance et al., 1972) is rooted on expected utility theory (von Neumann & Morgenstern, 1944). In contrast to TTO, the SG method is framed in a risky context. Here subjects have to assume that they will be in a certain health state poorer than full health for the rest of their lives (i.e., a chronic health state) and they are offered a medical treatment that could restore their health but implies a certain risk of dying. Individuals have to precisely state the maximum level of death risk they are willing to assume in order to recover full quality of life. The method is shown in Figure 3, and the utility of the impaired health state $(qi)$ is simply the value of p (the treatment’s chance of success), which makes the individual indifferent between both alternatives, that is, $U(qi)=p$.

Click to view larger

Figure 3. Standard gamble method. Better-than-death chronic health states.

Alternative 1: A gamble between full health, with probability p, and immediate death, with probability (1 – p). Alternative 2: State qi for the rest of life.

For states regarded as worse than death, the framing is quite similar, but the health state (qi) becomes the worst outcome of the gamble and immediate death is the alternative to the lottery, as can be seen in Figure 4. The utility is calculated as $U(qi)=–p/(1–p)$.

Click to view larger

Figure 4. Standard gamble method. Worse-than-death chronic health states.

Alternative 1: A gamble between full health, with probability p, and state qi with probability (1 – p). Alternative 2: Immediate death.

## Which Method Is the Best?

All of these three basic methods for measuring health utilities have certain problems and are subject to some shortcomings and biases. First, the rating scale or VAS scoring method is based on introspection, while TTO and SG methods are based on choices; that is, two alternatives are compared through a sequence of choices converging to an indifference point. The VAS method, although simpler to administer than the other two methods, has weaker theoretical support, since VAS scores are based on judgments and not choices. Moreover, choosing is a more natural human task than scaling, and it is observable and verifiable (Drummond et al., 2016). On the other hand, it is generally recognized that SG has stronger theoretical support than TTO since it is based on the axioms of expected utility theory.

Empirical evidence suggests that health utilities elicited with all of these methods may be biased due to factors which, a priori, should be irrelevant. These factors include the elicitation technique, the framing of the questions, the menu of choices, and the search procedure—that is, the precise routing to find the indifference value. For example, VAS measurements seem to be affected by contextual biases (Bleichrodt & Johannesson, 1997) in the way that scores highly depend on the range of health states to be rated and their relative severity (Robinson, Loomes, & Jones-Lee, 2001; Schwartz, 1998).

The main problems with the TTO scores are loss aversion, scale compatibility, and utility curvature (Bleichrodt, 2002). Loss aversion means that individuals evaluate choices as gains or losses with respect to a certain reference point. Tversky and Kahneman (1991) explained this psychological phenomenon by a reference-dependent model for riskless choices, as is the case for TTO. Scale compatibility holds that people give importance to the attribute that is used as response scale to search the indifference (time or number of years in TTO). Finally, utility curvature has to do with the nonlinearity of the utility function of time.

Loss aversion also affects SG utilities, which may be biased by probability weighting as well. These two biases have been described by prospect theory (Tversky & Kahneman, 1992), which assumes that people tend to overvalue small probabilities whereas high probabilities tend to be undervalued.

## Searching for the Indifference Values in Choice-Based Methods

In the case of methods based on choices (i.e., TTO and SG), an important issue is the way in which the indifference value is searched for. This is known as the “search procedure.” Although it would be possible to simply ask the subjects to state the value of p (in the SG) or x (in the TTO) that secures the indifference between the two alternatives under evaluation—what is called a “pure matching” procedure—both methods are usually administered through a series of choices (i.e., “choice-based matching”). This sequence of choices often follows an iterative path; that is, each choice of the sequence depends on the subject’s response to the previous choice. For instance, in a TTO, if an individual prefers to live 5 years in full health over 10 years in the health state of interest, the next choice will be between a number of years lower than 5 in full health versus 10 years in the impaired health state. Iterative search procedures may follow different routings such as bisection, titration (or up and down), ping-pong, and so on (Attema, Edelaar-Peeters, Versteegh, & Stolk, 2013). These procedures are usually “transparent” in the sense that the individual may easily notice that he or she is being led to an indifference value or interval. Less transparent methods have been suggested as a way of avoiding biases (e.g., anchoring effects) and reducing inconsistencies or discrepancies between preferences revealed through direct choices and those derived from choice-based matching procedures. These nontransparent methods try to hide the converging nature of the sequence of choices, and one of them is the parameter estimation sequence testing procedure, originally described by Taylor and Creelman (1967) in the field of psychophysics, which has been used in economics (Bostic, Herrstein, & Luce, 1990) and in the domain of health states valuation (Abellán-Perpinán, Sánchez-Martínez, Martínez-Pérez, & Méndez-Martínez, 2012; Bleichrodt, Doctor, & Stolk, 2005). Another nontransparent method is the hidden choice-based matching procedure, proposed by Fischer, Carmon, Ariely, and Zauberman (1999), which has been proven to reduce the rate of preference reversals in the valuation of health states (Pinto-Prades, Sánchez-Martínez, Abellán-Perpiñán, & Martínez-Pérez, 2018).

# Problems With Preference Elicitation

Until now, this article has explained basic issues around the QALY model. In order to use the model in health policy, the parameters of the model have to be estimated and those parameters have to reflect society’s preferences and values. To keep things simple, if one wants to apply the linear QALY model $U(q,t)=U(q)×t$, one must estimate $U(q)$. There has been a lot of debate about “whose utilities,” but the usual approach is to use an average $U(q)$ estimated from a representative sample of the general population. Researchers apply the methods explained earlier to elicit preferences of a representative sample of the general population and aggregate those responses in order to estimate the number of QALYs produced by different medical treatments. However, responding to VAS, TTO, or SG questions consistently is not easy. Responses must pass some rationality tests in order to be used by decision-makers to guide social policy. Unfortunately, this is not always the case. In the next section, the article shows some inconsistencies observed in the literature that has tried to elicit preferences from members of the general population. This is a very important issue. It would not be very helpful to have a model that cannot be estimated in practice.

## Order Effects

Most surveys conducted to calculate health state utilities ask subjects to evaluate several health states in the same session. For example, in EuroQol questionnaires (Dolan, Gudex, Kind, & Williams, 1995, 1996), subjects are asked to evaluate 13 health states using the TTO technique. The utility of a health state should be the same, regardless of its position within the group, since this position is arbitrary. However, several studies have observed that this is not the case. Augestad, Rand-Hendriksen, Kristiansen, and Stavem (2012) used data from 3,773 respondents of the U.S. EQ-5D valuation study, each of whom valued 12 health states (plus unconscious) in random order. Researchers found that utilities for mild health states were higher at the end of the sequence, while utilities for bad health states where lower at the end of the sequence. That produced more extreme values at the end of the sequence than at its beginning. Finnell, Carroll, and Downs (2012) tried to test whether the order in which SG and TTO utilities were obtained affected the relative (SG vs. TTO) values of the utilities obtained by each technique. Utilities were assessed for 29 health states from 4,016 subjects by using SG and TTO. The assessment order was randomized by respondent. For analysis by health state, the authors calculated (SG-TTO) for each assessment and tested whether the SG-TTO difference was significantly different between the two groups (SG first and TTO first). They found that in 19 of 29 health states, the SG-TTO difference was significantly greater when TTO was assessed first. Finally, Llewelyn-Thomas et al. (1982) elicited preferences between patients using two different techniques (category rating and SG). Some patients valued the health states starting with category rating and some others using the SG technique. The authors found that utilities were influenced by the order of the elicitation method. More specifically, they found that utilities were higher when the SG technique was used first.

## Starting Point Biases

There is a vast literature regarding anchoring effects in preference elicitation. One type of anchoring effect is the so-called starting point bias. Surprisingly, there is very little evidence regarding this potential bias in the valuation of health states. Augestad, Stavem, Kristiansen, Samuelsen and Rand-Hendriksen (2016) randomized a total of 1,249 respondents who valued eight EQ-5D health states in a web study using the TTO technique. All subjects had to compare 10 years in full health with a certain number of years in bad health and establish the indifference point. Respondents were randomized to 11 different starting points. The authors found that the values elicited using TTO were substantially influenced by the starting point of the task, supporting the anchoring hypothesis. The observed effects were substantial, with an estimated mean shift in values of 0.19 from the lowest to the highest starting point in one of the groups.

## Preference Reversals

The preference reversal is probably one of the most replicated violations of standard choice theories. A preference reversal is produced when two (apparently) equivalent methods of eliciting preferences lead to a contradiction in those preferences. According to one method the subject prefers A but, according to another method, the subject prefers B. Originally, the phenomenon was discovered in monetary lotteries. In stylized form, the phenomenon can be explained as follows: Subjects are presented with two lotteries, L1 and L2. L1 offers a high probability of a small amount of money. It is called the P-bet. L2 offers a low probability of a larger amount of money. It is called the $-bet. Subjects must evaluate the lottery using two methods. One is a valuation method; namely, they have to state the certain amount of money that is equivalent to the lottery. This is called the certainty or the monetary equivalent of the lottery. They also have to choose between L1 and L2. Usually, the monetary equivalent for the$-bet is higher than the monetary equivalent for the P-bet. This implies that the \$-bet has a higher value. However, when subjects choose between the two lotteries, they prefer the P-bet. This contradicts the results of the monetary equivalent method. The problem is that there is no way to know if people prefer one lottery or the other, since both methods (monetary equivalents and choices) should produce the same ranking between the lotteries. People pay more for those objects with higher value, and they choose the object with higher value. A preference reversal contradicts this elementary consistency rule. In the elicitation of preferences for health states, a preference reversal is any evidence that different methods produce different ordinal rankings for health states. The phenomenon is not related exclusively to situations where there is risk as in the case of monetary lotteries.

Stalmeier, Wakker, and Bezembinder (1997) conducted some experiments with a similar framing. Subjects were presented with several health states (e.g., living with migraine x days per week). Subjects were asked (in one case) three TTO questions (e.g., living x years with migraine 4.5 days per week versus y years in full health). They used three values for x: 5, 10, and 20 years. The pattern was very clear: subjects produce a higher value for y for higher values of x. However, in a direct choice, they preferred the profile with the shorter value of x. Subjects were not willing to modify their responses even when they acknowledged that there was a contradiction between the two responses.

## Direct Versus Chained Utilities

Another worrying result for health economics is that the so-called chained utilities do not match utilities elicited with the methods previously explained. This is a concern since, according to the traditional assumptions used in the QALY model (see earlier discussion), both approaches should produce the same utilities.

The utility of a health state obtained from a TTO or SG question where the end points are full health and death is considered a “classic” utility. “Chained” utilities are those obtained from a TTO or SG question where at least one of the outcomes is neither full health nor death. In this case, the response to a chained TTO/SG has to be linked to utilities obtained with direct TTO/SG in order to calculate the utility of the health state that is obtained through chaining. For example, assume that the utility of health state A is to be measured using chaining. First, a TTO question such as (A, x years; death) versus (B, z years; death) should be asked. Then U(B) has to be estimated using the classic TTO. More specifically, U(A)CHAINED would be obtained as U(B)CLASSIC (z/x). As mentioned, under the assumptions of the QALY model presented at the beginning of this article, it should be true that U(A)CHAINED = U(A)DIRECT. In a recent paper, Taylor, Chilton, Ronaldson, Metcalf, and Nielsen (2017) observed that health gains obtained as a difference between direct utilities of two health states do not correspond with the gains elicited asking subjects to directly to compare those health states. If health gains are different using two seemingly equivalent methods, it is not possible to know which is the real health gain. Some evidence comparing direct and chained utilities is discussed next.

Probably the first study to analyze this issue was Llewellyn-Thomas et al. (1982). The authors compared direct and chained utilities with the SG method using two types of chaining. To explain those two types, assume that there are three health states, say A, B and C, and subjects rank those three health states as follows: A better than B and B better than C. Llewellyn-Thomas et al. estimated U(A), U(B), and U(C) using the direct SG (with full health and death as outcomes of the gamble). Chained method 1 asked subjects to choose between a health state as chronic and a gamble where the best health state was not full health and the worst was immediate health. For example, they had to estimate p such that the subject was indifferent between health state B as chronic and a lottery (p, A; (1 – p), death). Then U(B)chained = p U(A). Chained method 2 asked subjects to choose between a health state as chronic and a gamble where the best health state was full health and the worst was not immediate health. For example, they had to estimate p such that the subject was indifferent between health state B as chronic and a lottery (p, full health; (1 – p), C). Then U(B)chained = p U(full health) + (1 – p) U(C). Their result was that chained method 2 produced higher utilities than the direct method while chained method 1 produced much lower utilities.

Other studies have observed this disparity. Sutherland, Dunn, and Boyd (1983) replicated the result of Llewellyn-Thomas et al. (1982) using VAS instead of SG. Stalmeier (2002) also obtained similar results using TTO.

The study by Morrison, Neilson, and Malek (2002) illustrates very well the implications of this problem for the measurement of the benefits of medical treatments. The objective of this study was to assess the benefits of surgical or orthotic interventions on patients with complex physical disabilities and the indirect effect of these interventions on the patients’ carers (usually a parent). The authors estimated the benefit of the treatment for the patient and the carer by means of the TTO technique. Patients and carers were asked a direct TTO question before and after the medical intervention. They were also asked a chained TTO question after the intervention. Researchers could not ask the chained TTO question before the intervention because the question involved a choice between more years in the better (postintervention) health state and the worst (preintervention) health state. The mean utility for the health state previous to the intervention was 0.733 for patients and 0.938 for carers. The mean utility for the health state after the intervention was 0.804 for patients and 0.943 for carers. This implies a benefit of 0.07 for patients and no benefit for carers. However, when subjects were asked the chained TTO, which involved a comparison between pre- and post-health states, the utility of the health state previous to the intervention was 0.635 for patients and 0.875 for carers. This implies that the chained procedure revealed much bigger gains for patients and carers. Direct and chained procedures should produce the same results, but they do not. In Morrison et al.’s study the benefit of the medical interventions crucially depends on the method used to measure those benefits. This is a very important problem if the QALY model is going to be used to guide social policy.

## Within-Methods Inconsistencies

There is plenty of evidence that show there are disparities between methods. The traditional disparity observed has been between valuation (matching) and choice. Here the focus is on disparities observed within the same method. Inconsistencies within SG, TTO, and VAS are reviewed.

One example of the disparities within the SG is Bleichrodt, Abellán-Perpiñán, Pinto-Prades, and Méndez-Martínez (2007). To understand their method, the reader has to remember that in the SG, subjects have to compare a health profile $(q,t)$ where q is quality of life and t is duration, with a (usually) binary gamble. The two outcomes are usually full health $(FH,t′)$ and death. That is to say, they have to compare $(q,t0)$ with $[ p,(FH,t1);(1–p),death ]$ and estimate indifference between the two options. In principle, the parameter used to estimate indifference should be irrelevant. The traditional method used to achieve indifference was to elicit the probability p while $t0=t1$, called the probability equivalent. Bleichrodt et al. compared this method with two other (apparently equivalent) methods. One method (certainty equivalent) presented subjects with a lottery with all parameters fixed (p and t1) and subjects had to reach indifference between $(q,t0)$ and $[ p,(FH,t1);(1–p),death ]$, varying t0. In the third method (value equivalence), variables p and t0 were fixed and subjects had to reach indifference between $(q,t0)$ and $[ p,(FH,t1);(1–p),death ]$ varying t1. Bleichrodt et al. found a large discrepancy in the utility of q calculated with those three methods. They estimate $U(q)$ for two health states. In the case of the mildest health state, $U(q)$ moved between 0.8 (for value equivalence) to 0.5 (for certainty equivalent). The probability equivalent method produced utilities between the certainty and value equivalent.

In the case of the TTO procedure, one example of an internal disparity is the study by Bleichrodt et al. (2003). The authors asked subjects a traditional TTO question; namely, they had to estimate the indifference point between the profiles $(q,t)$ and $(FH,x)$ where q is worse than FH and t > x. The standard TTO fixes t and asks subjects to provide the value of x that makes the two options indifferent. Here $ts$ is the value of t used as stimulus and $x*$ the response to the standard TTO question. Bleichrodt et al. used $x*$ in a second TTO question. In this case, the subject had to provide the value of t such that (q, t) and $FH,x*$ were indifferent: $t*$ is the response provided by the subjects. Obviously, one should expect $t*=ts$. However, they obtained $t*>ts$ producing significantly lower utilities. See also Bleichrodt and Pinto-Prades (2002) and Attema and Brouwer (2008, 2012) for similar types of study. Attema and Brouwer (2013) also observed such a discrepancy within TTO tasks, but it was much lower for indifferences elicited by means of ping-pong choices than for indifferences elicited by direct matching.

Finally, one example that showed internal inconsistencies within VAS is Robinson et al. (2001). The objective of the study was to evaluate several health states using a VAS. Subjects were split into two subgroups. In each group, subjects had to evaluate nine health states. Three health states were common to both groups. The rest of the health states were different. In one group, the health states that were different were mild, while in the other group they were severe. Robinson et al. found that the values of the common health states were different in the two subsamples. In the subgroup where the accompanying health states were severe, the three common health states received a higher value with VAS than in the group where the rest of health states were mild. Clearly, the value of health states was influenced by the context provided by the rest of the states.

There are many more examples of problematic results. It is not the purpose of this articles to provide a full list. The objective of the second part of this article has been to show the kind of problems that can be observed when trying elicit preferences for health states. This is a very important problem related to using QALYs in practice. How can one base social policy on values that can radically change depending on the elicitation methods used, the framing of the questions, the order of the question, and so on? The final section of this article (“Dealing with Problems”) shows how the literature has approached this problem.

# Using Utilities in Practice

The problems and limitations associated with the use of QALYs have been mentioned elsewhere (Lipscomb, Drummond, Fryback, Gold, & Revicki, 2009; Neumann et al., 2017, ch. 7). Neumann et al. and Lipscomb et al. point out that different elicitation methods (SG, TTO, VAS) and different multiattribute health utility profiles (e.g., Health Utilities Index, EuroQol) can produce different results. This has not prevented some national bodies from using utilities elicited in a certain way in the calculation of cost-effectiveness ratios. For example, the Canadian Agency for Drugs and Technologies in Health (2017) states that “researchers should use health preferences obtained from an indirect method of measurement that is based on a generic classification system (e.g., EuroQol 5-Dimensions questionnaire [EQ-5D], Health Utilities Index [HUI], Short Form 6-Dimensions [SF-6D]).” One problem is that, as Hanmer et al. (2016) have shown, health benefits change depending on the method used to estimate utilities. Hanmer et al. compared the impact of 16 prevalent chronic conditions using six utility-based indexes of health and VAS. The multiattribute utility profiles were EuroQol-5D-3L, Health and Activities Limitation Index, Health Utilities Index Mark 2 and Mark 3, preference-based scoring for the SF-36, and the Quality of Well-Being Scale. They found that there were significant differences between the indexes for estimates of the absolute impact of most conditions. Some institutions only advise the use of one single instrument. For example, the Dutch agency (Zorginstituut Nederland) and the National Institute for Health and Care Excellence recommend the use of the Euroqol. In summary, most of the agencies that use cost-effectiveness analysis seem to take a pragmatic perspective and use utilities obtained from interviewing members of the general public, using TTO or SG and some sort of multiattribute profile. The argument seems to be that, even accepting the limitation of those numbers, it is the best one can do to incorporate public preferences in healthcare decision-making.

However, this article presents evidence that should be a source of concern for the agencies that use those utilities. National agencies that use health state values should be worried if utilities depend on arbitrary decisions such as the order of evaluation of health states or if direct and chained utilities do not match. Lipscomb et al. (2009) argue that those problems are a challenge to researchers and practitioners, but they do not imply that those methods should be abandoned but instead improved. Concurring with that observation, this article ends by suggesting some ways to improve measurement techniques.

# Dealing With Problems

## Two Different Approaches

The standard way of dealing with the problems explained earlier is “debiasing” (Fischhoff, 1982). For example, Fischhoff (2010) suggests providing personalized feedback to subjects or training them extensively as debiasing mechanisms. Using some of those indications may indeed reduce biases and help avoid some of the problems mentioned. However, there is evidence that this is not enough, because some of the effects mentioned are the consequence of processes that happen subconsciously, and it is almost impossible for subjects to avoid them. Take, for example, the case of preference reversals. When researchers observed the phenomenon, they explained and showed subjects their inconsistency. However, very few people wanted to change their responses. What to do then?

The first option is to assume that subjects have “true” preferences that are consistent. However, those true preferences are not very well formed for questions that are not a part of everyday decisions, like hypothetical questions about health problems. Given that those preferences are not well defined, individuals’ responses are influenced by irrelevant elements, and this may produce response patterns that are inconsistent. Subjects may not be able to reconcile those inconsistencies, since the biases that generate those problems are probably better understood by the researcher than by the interviewee. Under those circumstances, some researchers advocate what some authors define as “preference purification.” Infante, Lecouteux, and Sugden (2016) define preference purification as reconstructing “individuals’ underlying or latent preferences by simulating what they would have chosen, had they not been subject to reasoning imperfections” (p. 6). This is the most popular method in health economics.

There is a less popular but not less interesting approach, namely, dealing with the main problem that generates most of those inconsistencies. This problem is called “preference imprecision.” Very little attention has been paid by researchers in health economics to the issue of imprecision. However, there are some papers (Pinto-Prades et al., 2018) showing that subjects with better-defined preferences have more consistent preferences; for example, they produce fewer preference reversals.

## Literature Overview: Debiasing

Thus far, the most popular method to debias health valuations is to apply a prospect theory correction of estimates that were obtained under the assumption of expected utility, using specific parametric shapes of the utility and probability weighting functions and a common set of estimates. Bleichrodt, van Rijn, and Johannesson (1999) show that incorporating probability weighting in the SG method improves the consistency of QALY-based decision-making with individual preferences. Elaborating on this approach, Bleichrodt, Pinto-Prades, and Wakker (2001) derived formulas to compute utilities for life duration that have been obtained by means of certainty equivalences or probability equivalences, under the erroneous assumption of expected utility. However, this approach can just as well be applied to correct health state valuation methods, including SG and TTO. The approach takes as a primitive that expected utility theory is the normative theory for decision-making under uncertainty, which is a widely held view (Broome, 1991; Edwards, 2013; Savage, 1954).

Bleichrodt et al. (2001) found that the initial discrepancy between utilities elicited with the certainty equivalent and those elicited with the probability equivalent disappeared after correction. Consequently, Bleichrodt et al. proposed using such a correction formula whenever interactive sessions to correct for biases are not possible. Further arguments for adopting a correction approach à la Bleichrodt et al. can be found in Pinto-Prades and Abellán-Perpinán (2012).

Bleichrodt et al. (2007) built upon the approach of Bleichrodt et al. (2001) and found that prospect theory could explain the systematic discrepancies found between riskless-risk methods and that it performed better than other nonexpected utility models. Abellán-Perpinán et al. (2009) reported that prospect theory leads to better health evaluations than expected utility for decisions under risk but not intertemporal decisions, where TTO did better than prospect theory–corrected SG. Recently, Lipman, Brouwer, and Attema (2017) explored the correction of SG and TTO estimates for prospect theory parameters estimated at the individual level. They observed that the initial significant difference between SG and TTO vanished after correction for individualized prospect theory parameters.

An alternative is to use methods that are less susceptible to the aforementioned biases, for example, by carefully constructing the stimuli or using procedures not affected by these biases (Bleichrodt et al., 2001; Payne, Bettman, & Schkade, 1999; Wakker & Deneffe, 1996). For example, Attema, Bleichrodt, and Wakker (2012) proposed the direct method to elicit utility of life duration, which is risk free and therefore not distorted by probability weighting. They showed that utility obtained with the certainty equivalent was more concave than utilities elicited with the direct method, but this difference was no longer significant after correction for probability weighting.

## Critiques on Debiasing

Apart from the question of how to correct for biases, one could ask the more fundamental question of whether or not utilities should be corrected at all. For instance, Infante et al. (2016) criticized the approach of Bleichrodt et al. (2001), arguing that it is paternalistic to correct utilities and that one cannot simply assume that violations of rationality as captured by expected utility theory is behavior that will necessarily make decision-makers worse off (see also Hausman, 2012). In particular, they argue that the idea that context-dependent choices are the result of errors of reasoning is misconceived.

The critique of Infante et al. (2016) also relates to the legitimacy of the concept of nudging healthy lifestyles, which, according to Sugden (2017), does not necessarily improve individuals’ welfare. Their main critique on the proposed methods to reconcile descriptive findings with normative economics seems to be the implied assumptions of expected utility and that the agent has preferences that are consistent with each other but that are not always reflected by measurement instruments (Infante et al., 2016).

The debiasing mechanisms described here are only valid insofar as prospect theory holds, which is sometimes debated (Birnbaum, 2008; Loomes, 2010). Moreover, prospect theory does not incorporate all biases that have been demonstrated in preference elicitation tasks. One important bias not captured by prospect theory is scale compatibility, which holds that respondents give relatively more weight to the outcome that is used as the response scale than to any other outcomes involved (Tversky, Sattath, & Slovic, 1988). There are no alternative theories that account for scale compatibility, and hence no corrections for its influence have been proposed yet. Furthermore, Bleichrodt and Pinto-Prades (2009) found a more fundamental preference reversal, which none of the existing theories was able to explain. Hence, correction mechanisms are no panacea.

Yet another stance that can be taken is to adopt a deliberative approach, where elicitation methods are first debiased and then it is left to the policymakers to put any bias back in, depending on the particular situation and the aims of the policymaker (Lipman, Brouwer, & Attema, 2019). For example, if a policymaker argues that probability weighting is a bias that should be corrected for at all cost, whereas loss aversion is deemed to be a behavior that will not cause individuals to be less well off, then he or she could decide to reinsert a loss aversion parameter for allocation problems that involve losses.

## Imprecise Preferences

Another issue is that preference reversals and observed rationality violations may at least partly be due to imprecise preferences. Butler and Loomes (2007) illustrate that preference reversals for choice and matching tasks can be explained by preference imprecision. Hence, it may not always be necessary to take into account prospect theory preferences. In addition, Butler and Loomes (2011) argue that preference imprecision can explain the violations of the axioms of independence and betweenness, which are essential to expected utility theory.

In a study comparing choice and matching implementation of the SG, Pinto-Prades et al. (2018) observed that more precise respondents were less prone to preference reversals than respondents who made more errors. They also found that the number of preference reversals is reduced when using nontransparent methods in the matching task. A similar finding was reported in a study on willingness to pay for health, where Pinto-Prades, Sánchez-Martínez, Abellán-Perpiñán, and Martínez-Pérez (2017) show how preference imprecision affects valuations. They recommend using joint evaluations instead of separate evaluations in contingent valuation studies. Hence, it is important to incorporate an appropriate error structure in any model that is used to describe preferences (Loomes, Pinto-Prades, Abellán-Perpiñán, & Rodríguez-Míguez, 2010).

It has been explained how health preferences elicited by traditional methods can be debiased to reduce some of the systematic errors commonly made by individuals. The empirical literature showing that these debiasing mechanisms indeed reduce inconsistencies between different valuation methods has been reviewed. However, it has also been stressed that many inconsistencies may be caused by imprecise preferences instead of systematic biases, and, hence, it is important to incorporate an adequate error theory as well. Finally, some other problems of debiasing have been discussed, and the evidence that not all scholars agree that it is desirable to debias preferences at all has also been highlighted.

Abellán-Perpiñán, J. M., Herrero, C., & Pinto-Prades, J. L. (2016). QALY-based cost-effectiveness analysis. In M. D. Adler & M. Fleurbaey (Eds.), The Oxford handbook of well-being and public policy (pp. 143–165). Oxford, U.K.: Oxford University Press.Find this resource:

Adler, M. D. (2006). QALYs and policy evaluation: A new perspective. Yale Journal of Health Policy, Law, and Ethics, 6, 1–92.Find this resource:

Attema, A. E., & Brouwer, W. B. F. (2013). In search of a preferred preference elicitation method: A test of the internal consistency of choice and matching tasks. Journal of Economic Psychology, 39, 126–140.Find this resource:

Attema, A. E., Edelaar-Peeters, Y., Versteegh, M. M., & Stolk, E. A. (2013). Time trade-off: One methodology, different methods. The European Journal of Health Economics, 14(Suppl. 1), S53–S64.Find this resource:

Bala, M. V., & Zarkin, G. A. (2000). Are QALYs an appropriate measure for valuing morbidity in acute diseases? Health Economics, 9(2), 177–180.Find this resource:

Bleichrodt, H., Pinto-Prades, J. L., & Wakker, P. P. (2001). Making descriptive use of prospect theory to improve the prescriptive use of expected utility. Management Science, 47(11), 1498–1514.Find this resource:

Bleichrodt, H., Wakker, P., & Johannesson, M. (1997). Characterizing QALYs by risk neutrality. Journal of Risk and Uncertainty, 15(2), 107–114.Find this resource:

Donaldson, C., Baker, R., Mason, H., Jones-Lee, M., Lancsar, E., Wildman, J., . . . Smith, R. (2011). The social value of a QALY: Raising the bar or barring the raise? BMC Health Services Research, 11, 8.Find this resource:

Johannesson, M., Pliskin, J. S., & Weinstein, M. C. (1994). A note on QALYs, time tradeoff, and discounting. Medical Decision Making, 14(2), 188–193.Find this resource:

Johnson, F. R. (2009). Editorial: Moving the QALY forward or just stuck in traffic? Value in Health, 12, 38–39.Find this resource:

Torrance, G. W. (1986). Measurement of health state utilities for economic appraisal: A review. Journal of Health Economics, 5(1), 1–30.Find this resource:

Torrance, G. W., Furlong, W., Feeny, D., & Boyle, M. (1995). Multi-attribute preference functions. Health Utilities Index. Pharmacoeconomics, 7(6), 503–520.Find this resource:

Weinstein, M. C., Torrance, G., & McGuire, A. (2009). QALYs: The basics. Value Health, 12(Suppl. 1), S5–S9.Find this resource:

## References

Abellán-Perpinán, J. M., Bleichrodt, H., & Pinto-Prades, J. L. (2009). The predictive validity of prospect theory versus expected utility in health utility measurement. Journal of Health Economics, 28(6), 1039–1047.Find this resource:

Abellán-Perpiñán, J. M., Herrero, C., & Pinto-Prades, J. L. (2016). QALY-based cost effectiveness analysis. In M. Adler & M. Fleurbaey (Eds.), The Oxford handbook of well-being and public policy (pp. 143–165). New York, NY: Oxford University Press.Find this resource:

Abellán-Perpinán, J. M., & Pinto-Prades, J. L. (2000). Quality adjusted life years as expected utilities. Spanish Economic Review, 2, 49–63.Find this resource:

Abellán-Perpinán, J. M., Pinto-Prades, J. L., Méndez, I., & Badía, X. (2006). Towards a better QALY model. Health Economics, 15(7), 665–676.Find this resource:

Abellán-Perpinán, J. M., Sánchez-Martínez, F. I., Martínez-Pérez, J. E., & Méndez-Martínez, I. (2012). Lowering the floor of the SF-6D scoring algorithm using a lottery equivalent method. Health Economics, 21(11), 1271–1285.Find this resource:

Attema, A. E., Bleichrodt, H., & Wakker, P. P. (2012). A direct method for measuring discounting and QALYs more easily and reliably. Medical Decision Making, 32(4), 583–593.Find this resource:

Attema, A. E., & Brouwer, W. B. F. (2008). Can we fix it? Yes we can! But what? A new test of procedural invariance in TTO‐measurement. Health Economics, 17(7), 877–885.Find this resource:

Attema, A. E., & Brouwer, W. B. F. (2010). On the (not so) constant proportional trade-off in TTO. Quality of Life Research, 19(4), 489–497.Find this resource:

Attema, A. E., & Brouwer, W. B. F. (2012). The way that you do it? An elaborate test of procedural invariance of TTO, using a choice-based design. The European Journal of Health Economics, 13(4), 491–500.Find this resource:

Attema, A. E., & Brouwer, W. B. F. (2013). In search of a preferred preference elicitation method: A test of the internal consistency of choice and matching tasks. Journal of Economic Psychology, 39, 126–140.Find this resource:

Attema A. E., Edelaar-Peeters, Y., Versteegh, M. M., & Stolk, E. A. (2013). Time trade-off: One methodology, different methods. The European Journal of Health Economics, 14(Suppl. 1), S53–S64.Find this resource:

Augestad, L. A., Rand-Hendriksen, K., Kristiansen, I. S., & Stavem, K. (2012). Learning effects in time trade-off based valuation of EQ-5D health states. Value in Health, 15(2), 340–345.Find this resource:

Augestad, L. A., Stavem, K., Kristiansen, I. S., Samuelsen, C. H., & Rand-Hendriksen, K. (2016). Influenced from the start: Anchoring bias in time trade-off valuations. Quality of Life Research, 25(9), 2179–2191.Find this resource:

Beresniak, A., & Dupont, D. (2016). Is there an alternative to quality-adjusted life years for supporting healthcare decision making? Expert Review of Pharmacoeconomics & Outcomes Research, 16(3), 351–357.Find this resource:

Birnbaum, M. H. (2008). New paradoxes of risky decision making. Psychological Review, 115(2), 463–501.Find this resource:

Bleichrodt, H. (2002). A new explanation for the difference between time trade-off utilities and standard gamble utilities. Health Economics, 11(5), 447–456.Find this resource:

Bleichrodt, H., Abellán-Perpiñán, J. M., Pinto-Prades, J. L., & Méndez-Martínez, I. (2007). Resolving inconsistencies in utility measurement under risk: Tests of generalizations of expected utility. Management Science, 53(3), 469–482.Find this resource:

Bleichrodt, H., Doctor, J., & Stolk, E. (2005). A nonparametric elicitation of the equity-efficiency trade-off in cost-utility analysis. Journal of Health Economics, 24(4), 655–678.Find this resource:

Bleichrodt, H., & Gafni, A. (1996). Time preference, the discounted utility model and health. Journal of Health Economics, 15(1), 49–66.Find this resource:

Bleichrodt, H., & Johannesson, M. (1997). An experimental test of a theoretical foundation for rating-scale valuations. Medical Decision Making, 17(2), 208–216.Find this resource:

Bleichrodt, H., & Pinto-Prades, J. L. (2002). Loss aversion and scale compatibility in two-attribute trade-offs. Journal of Mathematical Psychology, 46(3), 315–337.Find this resource:

Bleichrodt, H., & Pinto-Prades, J. L. (2005). The validity of QALYs under non-expected utility. The Economic Journal, 115(503), 533–550.Find this resource:

Bleichrodt, H., & Pinto-Prades, J. L. (2009). New evidence of preference reversals in health utility measurement. Health Economics, 18(6), 713–726.Find this resource:

Bleichrodt, H., Pinto-Prades, J. L., & Abellán, J. M. (2003). A consistency test of the time trade-off. Journal of Health Economics, 22(6), 1037–1052.Find this resource:

Bleichrodt, H., Pinto-Prades, J. L., & Wakker, P. P. (2001). Making descriptive use of prospect theory to improve the prescriptive use of expected utility. Management Science, 47(11), 1498–1514.Find this resource:

Bleichrodt, H., & Quiggin, J. (1997). Characterizing QALYs under a general rank dependent utility model. Journal of Risk and Uncertainty, 15(2), 151–165.Find this resource:

Bleichrodt, H., & Quiggin, J. (2013). Capabilities as menus: A non-welfarist basis for QALY evaluation. Journal of Health Economics, 32(1), 128–137.Find this resource:

Bleichrodt, H., van Rijn, J., & Johannesson, M. (1999). Probability weighting and utility curvature in QALY-based decision making. Journal of Mathematical Psychology, 43(2), 238–260.Find this resource:

Bostic, R., Herrstein, R. J., & Luce, R. D. (1990). The effect on the preference-reversal phenomenon of using choice indifferences. Journal of Economic Behavior and Organization, 13(2), 193–212.Find this resource:

Brazier, J., Dolan, P., Karampela, K., & Towers, I. (2006). Does the whole equal the sum of the parts? Patient‐assigned utility scores for IBS‐related health states and profiles. Health Economics, 15(6), 546–551.Find this resource:

Broome, J. (1991). Weighing goods. Oxford, U.K.: Blackwell.Find this resource:

Butler, D. J., & Loomes, G. C. (2007). Imprecision as an account of the preference reversal phenomenon. American Economic Review, 97(1), 277–297.Find this resource:

Butler, D., & Loomes, G. (2011). Imprecision as an account of violations of independence and betweenness. Journal of Economic Behavior and Organization, 80(3), 511–522.Find this resource:

Canadian Agency for Drugs and Technologies in Health. (2017). Guidelines for the economic evaluation of health technologies: Canada (4th ed.). Ottawa, ON: Author.Find this resource:

Dolan, P., Gudex, C., Kind, P., & Williams, A. (1995). A social tariff for EuroQol: Results from a UK general population survey. York, U.K.: University of York, Center for Health Economics.Find this resource:

Dolan, P., Gudex, C., Kind, P., & Williams, A. (1996). The time trade-off method: Results from a general population study. Health Economics, 5(2), 141–154.Find this resource:

Drummond, M. F., Sculpher, M. J., Claxton, K., Stoddart, G. L., & Torrance, G. W. (2016). Methods for the economic evaluation of health care programmes (4th ed.). Oxford, U.K.: Oxford University Press.Find this resource:

Edwards, W. (2013). Utility theories: Measurements and applications (Vol. 3). New York, NY: Springer Science & Business Media.Find this resource:

Fanshel, S., & Bush, J. W. (1970). A health-status index and its application to health-services outcomes. Operations Research, 18(6), 1021–1066.Find this resource:

Finnell, S. M. E., Carroll, A. E., & Downs, S. M. (2012). The utility assessment method order influences measurement of parents’ risk attitude. Value in Health, 15(6), 926–932.Find this resource:

Fischer, G. W., Carmon, Z., Ariely, D., & Zauberman, G. (1999). Goal-based construction of preferences: Task goals and the prominence effect. Management Science, 45(8), 1057–1075.Find this resource:

Fischhoff, B. (1982). Debiasing. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases (pp. 422–444). Cambridge, U.K.: Cambridge University Press.Find this resource:

Fischhoff, B. (2010). Judgment and decision making. Wiley Interdisciplinary Reviews: Cognitive Science, 1(5), 724–735.Find this resource:

Fishburn, P. C. (1965). Independence in utility theory with whole product sets. Operations Research, 13(1), 28–45.Find this resource:

Groot, W. (2000). Adaptation and scale of reference bias in self-assessments of quality of life. Journal of Health Economics, 19(3), 403–420.Find this resource:

Hanmer, J., Cherepanov, D., Palta, M., Kaplan, R. M., Feeny, D., & Fryback, D. G. (2016). Health condition impacts in a nationally representative cross-sectional survey vary substantially by preference-based health index. Medical Decision Making, 36(2), 264–274.Find this resource:

Hausman, D. M. (2012). Preference, value, choice, and welfare. Cambridge, U.K.: Cambridge University Press.Find this resource:

Infante, G., Lecouteux, G., & Sugden, R. (2016). Preference purification and the inner rational agent: A critique of the conventional wisdom of behavioural welfare economics. Journal of Economic Methodology, 23(1), 1–25.Find this resource:

Keeny, R. L., & Raiffa, H. (1976). Decision analysis with multiple conflicting objectives. New York, NY: Wiley.Find this resource:

Kuppermann, M., Shiboski, S., Feeny, D., Elkin, E. P., & Washington, A. E. (1997). Can preference scores for discrete states be used to derive preference scores for entire paths of events? Medical Decision Making, 17, 42–55.Find this resource:

Lipman, S. A., Brouwer, W. B. F., & Attema, A. E. (2017). QALYs without bias? Non-parametric correction of time trade-off and standard gamble utilities based on prospect theory. Erasmus University Rotterdam, Rotterdam, The Netherlands.Find this resource:

Lipman, S. A., Brouwer, W. B. F., & Attema, A. E. (2019). The corrective approach: Policy implications of recent developments in QALY measurement based on prospect theory. Value in Health, in press.Find this resource:

Lipscomb, J., Drummond, M., Fryback, D., Gold, M., & Revicki, D. (2009). Retaining, and enhancing, the QALY. Value in Health, 12, S18–S26.Find this resource:

Llewelyn-Thomas, H., Sutherland, H. J., Tibshirani, R., Ciampi, A., Till, J. E., & Boyd, N. F. (1982).The measurement of patients’ values in medicine. Medical Decision Making, 2(4), 449–462.Find this resource:

Loomes, G. (2010). Modeling choice and valuation in decision experiments. Psychological Review, 117(3), 902–924.Find this resource:

Loomes, G., Pinto-Prades, J.-L., Abellán-Perpiñán, J.-M., & Rodríguez-Míguez, E. (2010). Modelling noise and imprecision in individual decisions. Working Paper 10.03. Seville, Spain: Universidad Pablo de Olavide, Department of Economics.Find this resource:

MacKeigan, L. D., O’Brien, B. J., & Oh, P. I. (1999). Holistic versus composite preferences for lifetime treatment sequences for type 2 diabetes. Medical Decision Making, 19(2), 113–121.Find this resource:

McNeil, B. J., Weichselbaum, R., & Pauker, S. G. (1981). Tradeoffs between quality and quantity of life in laryngeal cancer. The New England Journal of Medicine, 305, 982–987.Find this resource:

Miyamoto, J. M., & Eraker. S. A. (1988). A multiplicative model of the utility of survival duration and health quality. Journal of Experimental Psychology: General, 117, 3–20.Find this resource:

Miyamoto, J. M., & Eraker. S. A. (1989). Parametric models of the utility of survival duration: Tests of axioms in a generic utility framework. Organizational Behavior and Human Decision Processes, 44(2), 166–202.Find this resource:

Morrison, G. C., Neilson, A., & Malek, M. (2002). Improving the sensitivity of the time trade-off method: Results of an experiment using chained TTO questions. Health Care Management Science, 5(1), 53–61.Find this resource:

National Institute for Health and Care Excellence. (2013). Guide to the methods of technology appraisal. London, U.K.: Author.Find this resource:

Neumann, P. J., Sanders, G. D., Russell, L. B., Siegel, J. E., & Ganiats, T. G. (2017). Cost-effectiveness in health and medicine. Oxford, U.K.: Oxford University Press.Find this resource:

Payne, J. W., Bettman, J. R., & Schkade, D. A. (1999). Measuring constructed preferences: Towards a building code. Journal of Risk and Uncertainty, 19(1–3), 243–270.Find this resource:

Pinto-Prades, J. L., & Abellán-Perpiñán, J. M. (2012). When normative and descriptive diverge: How to bridge the difference. Social Choice and Welfare, 38(4), 569–584.Find this resource:

Pinto-Prades, J. L., Sánchez-Martínez, F. I., Abellán-Perpiñán, J. M., & Martínez-Pérez, J. E. (2017). Improving scope sensitivity in contingent valuation: Joint and separate evaluation of health states. Health Economics, 26(12), e304–e318.Find this resource:

Pinto-Prades, J. L., Sánchez-Martínez, F. I., Abellán-Perpiñán, J. M., & Martínez-Pérez, J. E. (2018). Reducing preference reversals: The role of preference imprecision and nontransparent methods. Health Economics, 27(8), 1230–1246.Find this resource:

Pliskin, J. S., Shepard, D. S., & Weinstein, M. C. (1980). Utility functions for life years and health status. Operations Research, 28, 206–244.Find this resource:

Richardson, J., Hall, J., & Salkeld, G. (1996). The measurement of utility in multiphase health states. International Journal of Technology Assessment in Health Care, 12, 151–162.Find this resource:

Robinson, A., Dolan, P., & Williams, A. (1997). Valuing health states using VAS and TTO: What lies behind the numbers? Social Science and Medicine, 45, 1289–1297.Find this resource:

Robinson, A, Loomes, G., & Jones-Lee, M. (2001). Visual analog scales, standard gambles, and relative risk aversion. Medical Decision Making, 21(1), 17–27.Find this resource:

Savage, L. J. (1954). The foundations of statistics. New York, NY: Wiley.Find this resource:

Schwartz, A. (1998). Rating scales in context. Medical Decision Making, 18(2), 236.Find this resource:

Stalmeier, P. F. M. (2002). Discrepancies between chained and classic utilities induced by anchoring with occasional adjustments. Medical Decision Making, 22(1), 53–64.Find this resource:

Stalmeier, P. F., & Verheijen, A. L. (2013). Maximal endurable time states and the standard gamble: More preference reversals. European Journal of Health Economics, 14(6), 971–977.Find this resource:

Stalmeier, P., Wakker, P., & Bezembinder, T. (1997). Preference reversals: Violations of unidimensional procedure invariance. Journal of Experimental Psychology: Human Perception and Performance, 25(4), 1196–1205.Find this resource:

Sugden, R. (2017). Do people really want to be nudged towards healthy lifestyles? International Review of Economics, 64(2), 113–123.Find this resource:

Sutherland, H. J., Dunn, W., & Boyd, N. F. (1983). Measurement of values for states of health with linear analog scales. Medical Decision Making, 3(4), 477–487.Find this resource:

Sutherland, H. J., Llewelyn-Thomas, H., Boyd, N. F., & Till, J. E. (1982). Attitudes toward quality of survival. The concept of ‘maximal endurable time. Medical Decision Making, 2(3), 299–309.Find this resource:

Taylor, M., Chilton, S., Ronaldson, S., Metcalf, H., & Nielsen, J. S. (2017). Comparing increments in utility of health: An individual-based approach. Value Health, 20(2), 224–229.Find this resource:

Taylor, M. M., & Creelman, C. D. (1967). PEST: Efficient estimates on probability functions. The Journal of the Acoustical Society of America, 41(4), 782–787.Find this resource:

Torrance, G. W. (1976). Social preferences for health states: An empirical evaluation of three measurement techniques. Socio-Economic Planning Sciences, 10, 129–136.Find this resource:

Torrance, G. W. (1986). Measurement of health state utilities for economic appraisal. Journal of Health Economics, 12, 39–53.Find this resource:

Torrance, G. W., Thomas, W. H., & Sackett, D. L. (1972). A utility maximization model for evaluation of health care programs. Health Services Research, 7(2), 118–133.Find this resource:

Tversky, A., & Kahneman D. (1991). Loss aversion in riskless choice: A reference-dependent model. The Quarterly Journal of Economics, 106(4), 1039–1061.Find this resource:

Tversky, A., & Kahneman, D. (1992). Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4), 297–323.Find this resource:

Tversky, A., Sattath, S., & Slovic, P. (1988). Contingent weighting in judgment and choice. Psychological Review, 95(3), 371–384.Find this resource:

Verhoef, L. C. G., De Haan, A. F. J., & Van Daal, W. A. J. (1994). Risk attitude in gambles with years of life. Empirical support for prospect theory. Medical Decision Making, 14(2), 194–200.Find this resource:

Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior. Princeton, NJ: Princeton University Press.Find this resource:

Wakker, P. (1996). A criticism of healthy-years equivalents. Medical Decision Making, 16(3), 207–214.Find this resource:

Wakker, P., & Deneffe, D. (1996). Eliciting von Neumann-Morgenstern utilities when probabilities are distorted or unknown. Management Science, 42(8), 1131–1150.Find this resource: