The reliability of subjective well-being measures ☆
Alan B.Krueger a,⁎,David A.Schkade b
a
j2eePrinceton University,United States b University of California,San Diego,United States Received 19March 2007;received in revid form 9October 2007;accepted 29December 2007
Available online 16January 2008
Abstract
This paper studies the test –retest reliability of a standard lf-reported life satisfaction measure and of affect measures collected from a diary method.The sample consists of 229women who were interviewed on Thursdays,two weeks apart,in Spring 2005.The correlation of net affect (i.e.,duration-weighted positive feelings less negative feelings)measured two weeks apart is .64,which is slightly higher than the correlation of life satisfaction (r =.59).Correlations between income,net affect and life satisfaction are prented,and adjusted for attenuation bias due to measurement error.Life satisfaction is found to correlate much more strongly with income than does net affect.Components of affect that ar
西安日语培训
e more person-specific are found to have a higher test –retest reliability than components of affect that are more specific to the particular situation.While reliability figures for subjective well-being measures are lower than tho typically found for education,income and many other microeconomic variables,they are probably sufficiently high to support much of the rearch that is currently being undertaken on subjective well-being,particularly in studies where group means are compared (e.g.,across activities or demographic groups).
©2008Elvier B.V .All rights rerved.中体育
Keywords:Subjective well-being;Life satisfaction;Net affect;Day Reconstruction Method (DRM)
1.Introduction
Economists are increasingly analyzing data on subjective well-being (SWB).From 2000to 2006,157papers and numerous books have been published in the economics literature using data on life satisfaction or subjective well-being,according to a arch of Econ Lit .1Data on life satisfaction or happiness have been ud as outcome measures in studies of the tradeoff between inflation and unemployment,the effect of cigarette taxes on welfare,the effect of German reunification on well-being,and the effect of lottery winnings on well-being.2In addition,life and work
Journal of Public Economics 92(2008)1833–
/locate/econba
☆
The authors thank Daniel Kahneman,Norbert Schwarz,Arthur Stone and two anonymous referees for
helpful comments and the Hewlett Foundation,the National Institute on Aging,and Princeton University's Woodrow Wilson School for financial support.⁎Corresponding author.
E-mail address:akrueger@princeton.edu (A.B.Krueger).
1Prominent examples are Layard (2005),Blanchflower and Oswald (2004a,b),and Frey and Stutzer (2002).
2See,for examples,Di Tella et al.(2003),Gruber and Mullainathan (2002),Frijters et al.(2004),Gardner and Oswald (2001).
中文繁体字0047-2727/$-e front matter ©2008Elvier B.V .All rights rerved.
doi:10.1016/j.jpubeco.2007.12.015
1834 A.B.Krueger,D.A.Schkade/Journal of Public Economics92(2008)1833–1845
satisfaction measures have appeared as explanatory variables in studies of labor turnover,productivity and health.3If it could be measured accurately,or even approximately,subjective well-being is a natural variable for economists to model and understand becau utility maximization is a central idea in economics,from either a normative or positive perspective.
Here we analyze the test–retest reliability of two types of measures of subjective well-being:a standard life satisfaction question and affective experience measures derived from the Day Reconstruction Method(Kahneman et al.,2004).Although economists have longstanding rervations about the feasibility of interpersonal comparisons of utility that we can only partially address here,another question concerns the persistence of subjective well-being measurements for the same t of individuals over time.Abnt dramatic events,overall life satisfaction should not change much from week to week.Likewi,individuals who have similar routines from week to week should experience similar feelings over time.How persistent are individuals'respons to subjective well-being questions?To anticipate our main findings,both measures of subjective well-being(life satisfaction and affective experience)display a rial correlation of about.60when assd two weeks apart,which is lower than the reliability ratios typically found for education,income and many other common microeconomic variables(Bound et al.,2001;Angrist and Krueger, 1999).If measurement errors are white noi,a reliability ratio of.60implies substantial attenuation if the variable is ud as an explanatory variable in a regression.Measurement error when subjective well-being is ud as a dependent variable would imply a loss of precision in resulting estimates.Nonetheless,the estimated degree of reliability of subjective well-being data is probably high enough to detect effects when they are prent in most applications, especially if samples are large and the data are aggregated across people or activities.
The life satisfaction question that we examine is nearly identical to that ud in the World Values Survey,and similar to that ud in many other well-being surveys.There is a reason to expect,however,that life satisfaction measures such as this may not be as stable from week to week as might be assumed.Rather,the judgments are the result of a complex thought experiment,which is often partially dependent on transient 's mood at the time; e Schwarz and Strack,1999).
For measurements of the affective experience of daily life the gold standard is perhaps the Experience Sampling Method (ESM)(also called Ecological Momentary Asssment(EMA)),in which participants are prompted at random intervals to record their current circumstances and feelings(Csikszentmihalyi and Larson,1987;Stone et al.,1999).This method of measuring affect minimizes the role of memory and interpretation,but it is expensive and difficult to implement in large samples.The Day Reconstruction Method(DRM)is a recent development in the measurement of affective experience, which reduces the cost of obtaining this information.Conquently,we u the DRM,in which participants are requested to think about the preceding day,break it up into episodes,and describe each episode by lecting from veral menus (Kahneman et al.,2004).The DRM involves memory,but is designed to increa the accuracy of emotional recall by inducing retrie
val of the specifics of successive episodes(Robinson and Clore,2002;Belli,1998).Evidence that the two methods can be expected to yield similar results was prented earlier for subpopulation averages(Kahneman et al.,2004).
A critical advantage of the DRM is that it provides data on time u—a valuable source of information in its own right, which has rarely been combined with the study of subjective well-being.
In this paper we report reliability measures for a sample of229employed women who each filled out a DRM questionnaire for two Wednesdays,two weeks apart in2005.We compare the reliability estimates to tho of global well-being measures more typical in the literature,and we decompo the reliability of duration-weighted net affect into a component due to the similarity of activities across days and other factors.We also provide an application using the reliability estimates to correct obrved correlations between lf-reported well-being and other , income)for attenuation.We conclude with a discussion of the implications of measurement error for DRM studies and for well-being rearch more generally.
1.1.What is reliability and why should we care?
imax怎么读Consider an obrved variable,y,which is a noisy measure of the variable of interest,y⁎.We can write
y i=y i⁎+e i where y i is the obrved value for individual i,y i⁎is the“correct”value,and e i is the error term.Under the“classical measurement error”assumptions,e i is a white noi disturbance that is uncorrelated with y i⁎and homoskedastic.Classical measurement error will lead correlations between y and other variables to be attenuated toward0in large samples.4If we can measure y i
3See,for examples,Freeman(1978),Clark and Georgellis(2004),and Patterson et al.(2004).
4If y is of limited ,a binary variable)than e will necessarily be correlated with y⁎.We ignore this issue for the time being.
at two points in time,and if the measurement errors are independent and have a constant variance over time,then the correlation between the two measures provides an estimate of the ratio of the variance in the signal to the total variance in y .We thus define the reliability ratio,r ,as r =corr(y i 1,y i 2),where the superscripts indicate the measurement taken in periods 1and 2.Under the assumptions stated,plim r ¼var y ⁎ðÞ⁎ðÞþvar e ðÞ.In addition to summarizing the extent of random noi in subjective well-being reports,the signal-to-total variance ratio is of interest becau,in the limit,it equals the proportional bias that aris when SWB is an explanatory variable in a bivariate regression.Furthermore,as we explain below,correlations between SWB and other variables are atte
nuated by random measurement error in SWB.An important application of SWB data involves estimating the correlations among life satisfaction,affect and other variables such as income (e.g.,Argyle,1999).We can u the reliability ratio to correct tho correlations for attenuation,which would mean that many reported relationships are stronger than previously thought.
Of cour,if the measurement error is not classical,the test –retest correlation can under-or over-state the signal-to-total variance ratio,depending on the nature of the deviation from classical measurement error.With only two reports of y ,and without knowledge of y ⁎,it is not possible to asss the plausibility of the classical measurement error assumptions.If the errors in measurement are positively correlated over time,then the test –retest correlation will over-state the reliability of the data.Nevertheless,the test –retest correlation is a convenient starting point for summarizing the reliability of subjective well-being data.
1.2.Related literature
There is a vast empirical literature on subjective well-being (e Kahneman et al.,1999for a survey).Subjective well-being is most commonly measured by asking people a single question,such as,“All things considered,how satisfied are you with your life as a whole the days?”or “Taken all to
gether,would you say that you are very happy,pretty happy,or not too happy?”Such questions elicit a global evaluation of one's life.Surveys in many countries conducted over decades indicate that,on average,large increas in per capita national income have been found to have little effect on reported global judgments of life satisfaction or happiness over the last four decades.Although reported life satisfaction and houhold income are positively correlated in a cross ction of people at a given time,increas in income have been found to have mainly a transitory effect on individuals'reported life satisfaction (Easterlin,1995).5Moreover,the correlation between income and subjective well-being is notably weaker when a measure of experienced happiness is ud instead of life satisfaction (Kahneman et al.,2006).Of cour,such low correlations could be partially due to attenuation,if measurement error is high.
Table 1summarizes past estimates of the reliability of SWB measures.Single-item measures of SWB have been found to have moderate reliabilities,usually between .40and .66,even when asked twice in the same ssion 1hour apart (Andrews and Whithey,1976).Kammann and Flett (1983)found that single-item well-being questions under the instructions to consider “the past few weeks ”or “the days ”had reliabilities of.50to.55when asked within the same day.Interestingly,the only study we are aware of that looked at the reliability of an ESM measure of duration-weighted happiness found a cor
职称外语考试有效期relation on the upper end of the range found for single-item global well-being measures (Steptoe et al.,2005).Overall,there has been surprisingly little attention paid to reliability,despite the wide u of the measures.
The Satisfaction with Life Scale (SWLS,Diener et al.,1985)is another commonly ud global satisfaction measure.In contrast to the single question measures it consists of the average of five related items,each of which is rated on a 7-point scale from Strongly Disagree (1)to Strongly Agree (7).The items are:“In most ways my life is clo to my ideal ”;“The conditions of my life are excellent ”;“I am satisfied with my life ”;“So far I have gotten the important things I want in life ”;and “If I could live my life over,I would change almost nothing ”.A key reason that SWLS has proven more reliable than single-item questions (e Table 1),is that since it is the sum of multiple items,it benefits from error reduction through aggregation.Eid and Diener (2004)ud a structural model to estimate reliability for a sample of 249students,measured three times with four weeks between successive measurements.After controlling for the influence of situation-specific factors,they estimated that the imputed stability for life satisfaction was very high,around .90.
5
See Deaton (2007)for a careful study of the relationship between average subjective well-being across countries and the logarithm GDP per capita.Deaton finds a positive effect of the level of GDP,but GDP growth has a negative association with SWB.1835
A.B.Krueger,D.A.Schkade /Journal of Public Economics 92(2008)1833–1845
1836 A.B.Krueger,D.A.Schkade/Journal of Public Economics92(2008)1833–1845
Table1
Estimates of Reliability for Well-Being Measures
Test–retest correlation Temporal interval Variable
Single-item measures
Andrews and Whithey(1976).40–.661hour Life satisfaction
Kammann and Flett(1983).50–.55Same day Overall happiness,satisfaction Multiple-item measures a
Alfonso and Allison(1992).832weeks SWLS
Pavot et al.(1991).841month SWLS
Blais et al.(1989).642months SWLS
Diener et al.(1985).822months SWLS
Yardley and Rice(1991).5010weeks SWLS
中岛岭雄Magnus et al.(1992).544years SWLS
ESM
Steptoe et al.(2005).65Weekend–weekday Experienced happiness
a Note:From Pavot and Diener(1993),Table2.
One reason for the modest reliability of subjective well-being measures compared with education and income, which typically have reliability ratios of around.90,could be the susceptibility of SWB questions to transient mood effects.For example,rearchers have documented mood changes due to such subtle events as finding a dime before filling out a questionnaire,the current weather,or quest
ion order,which in turn influence reported life , Schwarz,1987).Eid and Diener(2004)ud a structural model,which attempted to parate situational variability from random error and basic stability,and found that anywhere from4%to25%of the variance in various affect and satisfaction measures were accounted for by situation-specific factors.In an earlier study,Ferring et al.(1996) estimated the size of transient factors as between12%and34%of total variance.Since the experienced affect measure produced by the DRM is focud on reconstructing a specific event and the affect actually experienced during it,there is at least the possibility that such measures will be less vulnerable to current mood at the time of the interview.首饰英文
We might expect DRM measures to be less reliable over time than life satisfaction,however,becau a person's activities change from day to day,and affect is associated with activities.At the same time,DRM measures are averages of multiple respons,while global life satisfaction of happiness is often assd with just one question.If ESM is any guide,the DRM may be at least as reliable as reported overall life satisfaction.
2.Method
We evaluate the test–retest reliability of the DRM by having the same respondents complete a DRM
questionnaire two weeks apart regarding the same day of the week(Wednesday).The questionnaire,which is available from the authors on request,also contained standard global life satisfaction measures.The resulting data provide information for the same sample about the relative stability of the DRM compared to the types of global life satisfaction questions ud in most well-being rearch.
For comparability with some previous studies,the respondents(n=229)were lected by random lection of women from the driver's licen list in Travis County,Texas and screened for employment and age between18and60. Respondents were paid$50upon completing the first questionnaire and an additional$100upon completing the cond one for a total of$150.6The interview dates were two Thursdays,March31,2005and April14,2005. Following the DRM procedure,participants reported on the previous day.Completion times for the lf-administered instrument ranged from45to75min.The ethnic composition of the sample was67%white(non-Hispanic),7% 6A total of241respondents completed the first questionnaire.Of the,10did not show up for the cond ssion and2showed up but failed to follow the questionnaire instructions sufficiently.There was no significant difference between the229who completed both questionnaires and the 12dropouts on any demographic variable,nor on life satisfaction or mood in data from the f
irst ssion(which they all completed).While it is possible that the greater payment in the cond ssion could have influenced well-being,the payment came after the questionnaires were completed,and we found no significant differences in the level of satisfaction or average affect between ssions.
African American,21%Hispanic,and 5%other.Average age was 42.8years.Median houhold income category was $40,000–$50,000.It is possible that the offer of a payment affected individuals'moods,but it is important to note that compensation was not received until after the interview was completed.
The DRM protocol described by Kahneman et al.(2004)was followed.Groups of participants were invited to a central location for a ssion on Thursday evening,where they answered a ries of questions contained in four packets.The first packet included general satisfaction and demographic questions.Next,the respondents were asked to construct a diary of the previous day (Wednesday)as a ries of episodes,noting the content and the beginning and end time of each.In the third packet,they were asked for a detailed description of every episode as explained below.The fourth packet contained some general attitude and demographic questions.The average number of episodes a respondent described for the day was somewhat higher in the cond ssion (14.8vs 13.
2,p b .001,by a paired t -test)although the total time covered by the episodes was no different (16.8vs 16.7h,p N .20,by a paired t -test).The figures compare to the 14.1episodes and 15.4h reported in Kahneman et al.(2004).
The first few questions in the survey were global SWB questions.First was the overall life satisfaction question,“Taking all things together,how satisfied are you with your life as a whole the days?Are you very satisfied,satisfied,not very satisfied,not at all satisfied?”Next,similar satisfaction questions were asked for “your life at home ”and “your prent job ”.Two global mood questions followed,for home and for work.The question pod was,“When you are at home,what percentage of the time are you in a bad mood____%,a little low or irritable____%,in a mildly pleasant mood____%,in a very good mood____%”.The last two respon categories were added together to obtain the percentage of time in a good mood.Net mood was computed by subtracting the sum of the first two respon categories from the sum of the last two.The same procedure was applied to the work mood question.
All respondents received the packets in the same order.The DRM is unlikely to be contaminated by the life satisfaction question coming first,becau it involves a very detailed procedure which is carefully grounded in the actual experiences people had.On the other hand,there could be a potentially rious problem with having the life satisfaction question after the DRM,since it is a thoug
ht experiment and more susceptible to transient influences and order effects.In addition,the overall life satisfaction question usually comes before domain-specific questions in well-being surveys,so we are matching the literature on this point as well.
The affect measures derived from the DRM are combinations of the duration-weighted affective adjectives that respondents rated for each episode.Net affect was computed by subtracting the average of negative emotions (encompassing ten/stresd,depresd/blue and angry/hostile),denoted negative affect (NA),from the average of positive emotions (encompassing happy,affectionate/friendly and calm/relaxed),denoted positive affect (PA).7Difmax is the duration-weighted average of happy less the maximum of ten/stresd,depresd/blue,angry/hostile.The U-index is cloly related to Difmax,and equals one when Difmax b 0and equals 0otherwi;that is,at the episode level,the U-index equals one if the most inten feeling is a negative one.We duration weight the episode-level data,so the person-level U-index measures the proportion of time that an individual spends in an unpleasant state.Difmax and the U-index reflect the intuition that an episode can be aversive if only one of the negative feelings is inten (Kahneman and Krueger,2006).
3.Results
Table 2prents the correlations between various measures for the same person in the first and cond ssions,as well as 95%confidence intervals.We focus first on overall measures of affective experience.Perhaps the most surprising finding is that the reliabilities of Net Affect (r =.64)and Difmax (r =.60)are at least as high as that for life satisfaction (r =.59).Satisfaction with domains of life (work and home)is more reliable than satisfaction with life overall.8The corresponding home and work mood measures are also more reliable than life satisfaction.Another notable feature of the results is that positive affect appears to be somewhat more reliable than negative affect.
The extent to which a person's rating of a particular adjective over different episodes of the day reprents personal traits or is influenced by the variability in situations is likely related to the reliability of that adjective.If a given person 7
Frustrated was excluded from negative affect for comparability with our other studies.
8See Kristenn and Westergaard-Nieln (2006)for a study of the reliability of lf-reported job satisfaction in six European Union countries.Using a 10-point scale job satisfaction question that was administered twice in the same survey,they find that 80%of workers classified themlves identically or within one point,and differences in classifications were symmetric around zero.1837
A.B.Krueger,D.A.Schkade /Journal of Public Economics 92(2008)1833–1845
1838 A.B.Krueger,D.A.Schkade/Journal of Public Economics92(2008)1833–1845
Table2
Correlations between lected measures at period1and period2
thrift shop
Obrved95%confidence interval
Lower Upper Global measures
Life satisfaction.59.49.67 Home satisfaction.74.68.80 Work satisfaction.68.61.75 Home net mood.70.63.76 Work net mood.68.61.75 Experience measures
Net affect.64.56.71 Difmax.60.51.68 U-index.50.40.59 Positive affect
Happy.62.54.70 Affectionate/friendly.68.61.75 Calm/relaxed.56.46.64 PA.68.61.75 Negative affect
Ten/stresd.54.44.62 Depresd/blue.60.51.68 Angry/hostile.54.44.63 Frustrated.48.37.57 NA.60.51.68 Other affect adjectives
Impatient for it to end.56.47.65 Competent/doing well.64.55.71 Interested/focud.57.47.65 Tired.65.56.72 Demographics
Houhold income.96.95.97 Education(years).98.98.99 Age 1.00 1.00 1.00 Note:confidence intervals for the correlations are not symmetric becau they are bad on the non-linear Fisher's z transformation(z=.5[ln(1+r)−ln(1−r)]), which is normally distributed and ud for significance testing.
Sample sizes are228or229,except for age,which is223due to missing data.
tends to feel the same way most of the time(a“happy”person or a“depresd”person)regardless of the situation,then this adjective might be expected to have greater reliability across the two ssions,since the activities the person engages in on the two days vary.To crudely gauge the extent to which particular adjectives are person-bound or situation-bound,for each adjective we pooled the two ssions and computed the variance of the duration-weighted personal averages across people and the average variance within each person's days across episodes,and then took the ratio of the between-people to within-person variances.A high ratio would indicate that an adjective is relatively constant for a person(more of an individual difference like a trait)and a low ratio would indicate that an adjective is determined more by the situation than who the person is.Results are shown in Table3.
Quite plausibly,feeling depresd appears to be a more trait-like descriptor,while feeling ten/stresd or impatient for an episode to end are highly situational.Interestingly,we found a correlation of.41between the variance ratio and the reliability ratios shown in Table3,which indicates moderate support for the hypothesis of greater reliability for trait-like emotions.9okes
9We also computed the ratios for the DRM sample in Kahneman et al.(2004).The two samples produced very similar ts of ratios—for the8 adjectives in common between the two samples the correlation of the ratios was.89.