期刊论文详细信息
BMC Medical Research Methodology
A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples
Kilem L Gwet1  Danny Wedding3  Tinakon Wongpakaran2  Nahathai Wongpakaran2 
[1] Statistical Consultant Advanced Analytics, LLC PO Box 2696, Gaithersburg, Maryland, USA;Department of Psychiatry, Faculty of Medicine, Chiang Mai University, Chiang Mai 50200, Thailand;California School of Professional Psychology, Alliant International University, San Francisco, California, USA
关键词: Personality disorders;    Gwet’s AC1;    Cohen’s Kappa;    Coefficients;    Inter-rater reliability;   
Others  :  1109795
DOI  :  10.1186/1471-2288-13-61
 received in 2012-08-31, accepted in 2013-04-26,  发布年份 2013
PDF
【 摘 要 】

Background

Rater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; however, there are well documented statistical problems associated with the measure. In order to assess its utility, we evaluated it against Gwet’s AC1 and compared the results.

Methods

This study was carried out across 67 patients (56% males) aged 18 to 67, with a mean SD of 44.13 ± 12.68 years. Nine raters (7 psychiatrists, a psychiatry resident and a social worker) participated as interviewers, either for the first or the second interviews, which were held 4 to 6 weeks apart. The interviews were held in order to establish a personality disorder (PD) diagnosis using DSM-IV criteria. Cohen’s Kappa and Gwet’s AC1 were used and the level of agreement between raters was assessed in terms of a simple categorical diagnosis (i.e., the presence or absence of a disorder). Data were also compared with a previous analysis in order to evaluate the effects of trait prevalence.

Results

Gwet’s AC1 was shown to have higher inter-rater reliability coefficients for all the PD criteria, ranging from .752 to 1.000, whereas Cohen’s Kappa ranged from 0 to 1.00. Cohen’s Kappa values were high and close to the percentage of agreement when the prevalence was high, whereas Gwet’s AC1 values appeared not to change much with a change in prevalence, but remained close to the percentage of agreement. For example a Schizoid sample revealed a mean Cohen’s Kappa of .726 and a Gwet’s AC1of .853 , which fell within the different level of agreement according to criteria developed by Landis and Koch, and Altman and Fleiss.

Conclusions

Based on the different formulae used to calculate the level of chance-corrected agreement, Gwet’s AC1 was shown to provide a more stable inter-rater reliability coefficient than Cohen’s Kappa. It was also found to be less affected by prevalence and marginal probability than that of Cohen’s Kappa, and therefore should be considered for use with inter-rater reliability analysis.

【 授权许可】

   
2013 Wongpakaran et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150203023157680.pdf 162KB PDF download
【 参考文献 】
  • [1]First MB, Gibbon M, Spitzer RL, Williams JBW, Benjamin LS: Structured Clinical Interview for DSM-IV Axis II Personality Disorder (SCID-II). Washington, DC: merican Psychiatric Press; 1997.
  • [2]Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas 1960, 20:37-46.
  • [3]Cohen J: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull 1968, 70:213-220.
  • [4]Wongpakaran T, Wongpakaran N, Bookkamana P, Boonyanaruthee V, Pinyopornpanish M, Likhitsathian S, Suttajit S, Srisutadsanavong U: Interrater reliability of Thai version of the Structured Clinical Interview for DSM-IV Axis II Personality Disorders (T-SCID II). J Med Assoc Thai 2012, 95:264-269.
  • [5]Dreessen L, Arntz A: Short-interval test-retest interrater reliability of the Structured Clinical Interview for DSM-III-R personality disorders (SCID-II) in outpatients. J Pers Disord 1998, 12:138-148.
  • [6]Weertman A, Arntz A, Dreessen L, van Velzen C, Vertommen S: Short-interval test-retest interrater reliability of the Dutch version of the Structured Clinical Interview for DSM-IV personality disorders (SCID-II). J Pers Disord 2003, 17:562-567.
  • [7]Cicchetti DV, Feinstein AR: High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol 1990, 43:551-558.
  • [8]Di Eugenio B, Glass M: The Kappa Statistic: A Second Look. Comput Linguist 2004, 30:95-101.
  • [9]Gwet KL: Handbook of Inter-Rater Reliability. The Definitive Guide to Measuring the Extent of Agreement Among Raters. 2nd edition. Gaithersburg, MD 20886–2696, USA: Advanced Analytics, LLC; 2010.
  • [10]Gwet KL: Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol 2008, 61:29-48.
  • [11]Kittirattanapaiboon P, Khamwongpin M: The Validity of the Mini International Neuropsychiatric Interview (M.I.N.I.)-ThaiVersion. Journal of Mental Health of Thailand 2005, 13:126-136.
  • [12]Gwet K: Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity. http://www.agreestat.com/research_papers/inter_rater_reliability_dependency.pdf webcite
  • [13]Gwet K: Kappa is not satisfactory for assessing the extent of agreement between raters. http://www.google.ca/url?sa=t&rct=j&q=kappa%20statistic%20is%20not% webcite
  • [14]Day FC, Schriger DL, Annals Of Emergency Medicine Journal Club: A consideration of the measurement and reporting of interrater reliability: answers to the July 2009 Journal Club questions. Ann Emerg Med 2009, 54:843-853.
  • [15]Arntz A, van Beijsterveldt B, Hoekstra R, Hofman A, Eussen M, Sallaerts S: The interrater reliability of a Dutch version of the Structured Clinical Interview for DSM-III-R Personality Disorders. Acta Psychiatr Scand 1992, 85:394-400.
  • [16]Lobbestael J, Leurgans M, Arntz A: Inter-rater reliability of the Structured Clinical Interview for DSM-IV Axis I Disorders (SCID I) and Axis II Disorders (SCID II). Clin Psychol Psychother 2011, 18:75-79.
  • [17]Kongerslev M, Moran P, Bo S, Simonsen E: Screening for personality disorder in incarcerated adolescent boys: preliminary validation of an adolescent version of the standardised assessment of personality - abbreviated scale (SAPAS-AV). BMC Psychiatry 2012, 12:94. BioMed Central Full Text
  • [18]Chan YH: Biostatistics 104: correlational analysis. Singapore Med J 2003, 44:614-619.
  • [19]Hartling L, Bond K, Santaguida PL, Viswanathan M, Dryden DM: Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy. J Clin Epidemiol 2011, 64:861-871.
  • [20]Hernaez R, Lazo M, Bonekamp S, Kamel I, Brancati FL, Guallar E, Clark JM: Diagnostic accuracy and reliability of ultrasonography for the detection of fatty liver: a meta-analysis. Hepatology 2011, 54:1082-1090.
  • [21]Sheehan DV, Sheehan KH, Shytle RD, Janavs J, Bannon Y, Rogers JE, Milo KM, Stock SL, Wilkinson B: Reliability and validity of the Mini International Neuropsychiatric Interview for Children and Adolescents (MINI-KID). J Clin Psychiatry 2010, 71:313-326.
  • [22]Ingenhoven TJ, Duivenvoorden HJ, Brogtrop J, Lindenborn A, van den Brink W, Passchier J: Interrater reliability for Kernberg's structural interview for assessing personality organization. J Pers Disord 2009, 23:528-534.
  • [23]Øiesvold T, Nivison M, Hansen V, Sørgaard KW, Østensen L, Skre I: Classification of bipolar disorder in psychiatric hospital. A prospective cohort study. BMC Psychiatry 2012, 12:13.
  • [24]Clement S, Brohan E, Jeffery D, Henderson C, Hatch SL, Thornicroft G: Development and psychometric properties the Barriers to Access to Care Evaluation scale (BACE) related to people with mental ill health. BMC Psychiatry 2012, 12:36. BioMed Central Full Text
  • [25]McCoul ED, Smith TL, Mace JC, Anand VK, Senior BA, Hwang PH, Stankiewicz JA, Tabaee A: Interrater agreement of nasal endoscopy in patients with a prior history of endoscopic sinus surgery. Int Forum Allergy Rhinol 2012, 2:453-459.
  • [26]Ansari NN, Naghdi S, Forogh B, Hasson S, Atashband M, Lashgari E: Development of the Persian version of the Modified Modified Ashworth Scale: translation, adaptation, and examination of interrater and intrarater reliability in patients with poststroke elbow flexor spasticity. Disabil Rehabil 2012, 34:1843-1847.
  • [27]Gisev N, Bell JS, Chen TF: Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Res Social Adm Pharm  ,  : . In press
  • [28]Petzold A, Altintas A, Andreoni L, Bartos A, Berthele A, Blankenstein MA, Buee L, Castellazzi M, Cepok S, Comabella M: Neurofilament ELISA validation. J Immunol Methods 2010, 352:23-31.
  • [29]Yusuff KB, Tayo F: Frequency, types and severity of medication use-related problems among medical outpatients in Nigeria. Int J Clin Pharm 2011, 33:558-564.
  文献评价指标  
  下载次数:15次 浏览次数:46次