Kappa系数:一种衡量评估者间一致性的常用方法
Biostatistics in psychiatry (25)?
howardKappa coefficient: a popular measure of rater agreement
Wan TANG 1*, Jun HU 2, Hui ZHANG 3, Pan WU 4, Hua HE 1,5
1 Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY , United States
2
College of Basic Science and Information Engineering, Yunnan Agricultural University, Kunming, Yunnan Province, China 3ielts是什么意思
nndDepartment of Biostatistics, St. Jude Children’s Rearch Hospital, Memphis, TN, United States 4
Value Institute, Christiana Care Health System, Newark, DE, United States 5
自我怀疑
Center of Excellence for Suicide Prevention, Canandaigua VA Medical Center Canandaigua, NY , United States *correspondence: ***********************.edu
A full-text Chine translation of this article will be available at www.shanghaiarchivesofpsychiatry/cn on March 25, 2015.
Summary: In mental health and psychosocial studies it is often necessary to report on the between-rater agreement of measures ud in the study. This paper discuss the concept of agreement, highlighting its fundamental difference from correlation. Several examples demonstrate how to compute the kappa coefficient – a popular statistic for measuring agreement – both by hand and by using statistical software packages such as SAS and SPSS. Real study data are ud to illustrate how to u and interpret this coefficient in clinical rearch and practice. The article concludes with a discussion of the limitations of the coefficient. Keywords: interrater agreement; kappa coefficient; weighted kappa; correlation [Shanghai Arch Psychiatry . 2015; 27(1): 62-67. doi: 10.11919/j.issn.1002-0829.215010]什么是cookie
化学实验基本操作1. Introduction
For most physical illness such as high blood pressure and tuberculosis, definitive diagnos can be made using medical devices such as a sphygmomanometer for blood pressure or an X-ray for tuberculosis. However, there are no error-free gold standard physical indicators of mental disorders, so the diagnosis and verity of mental disorders typically depends on the u of instruments (questionnaires) that attempt to measure latent multi-faceted constructs. For example, psychiatric diagnos are often bad on criteria specified in the Fourth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV)[1], published by the American Psychiatric Association. But different clinicians may have different opinions about the prence or abnce of the specific symptoms required to determine the prence of a diagnosis, so there is typically no perfect agreement between evaluators. In this situation, statistical methods are needed to address variability in clinicians’ ratings.
Cohen’s kappa is a widely ud index for asssing agreement between raters.[2] Althou
gh similar in appearance, agreement is a fundamentally different concept from correlation. To illustrate, consider an instrument with six items and suppo that two raters’ ratings of the six items on a single subject are (3,5), (4,6), (5,7), (6,8), (7,9) and (8,10). Although the scores of the two raters are quite different, the Pearson correlation
hope to e you sooncoefficient for the two scores is 1, indicating perfect correlation. The paradox occurs becau there is a bias in the scoring that results in a consistent difference of 2 points in the scores of the two raters for all 6 items in the instrument. Thus, although perfectly correlated (precision), there is quite poor agreement between the two raters. The kappa index, the most popular measure of raters’ agreement, resolves this problem by asssing both the bias and the precision between raters’ ratings.
In addition to its applications to psychiatric diagnosis, the concept of agreement is also widely applied to asss the utility of diagnostic and screening tests. Diagnostic tests provide information about a patient’s condition that clinicians’ often u when making decisions about the management of patients. Early detection of dia or of important c
crayonhanges in the clinical status of patients often leads to less suffering and quicker recovery, but fal negative and fal positive screening results can result in delayed treatment or in inappropriate treatment. Thus when a new diagnostic or screening test is developed, it is critical to asss its accuracy by comparing test results with tho from a gold or reference standard. When asssing such tests, it is incorrect to measure the correlation of the results of the test and the gold standard, the correct procedure is to asss the agreement of the test results with the gold standard.
>sheer