Measures of Agreement with Multiple Raters: Fréchet Variances and Inference.

IF 2.9 2区 心理学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS
Psychometrika Pub Date : 2024-06-01 Epub Date: 2024-01-08 DOI:10.1007/s11336-023-09945-2
Jonas Moss
{"title":"Measures of Agreement with Multiple Raters: Fréchet Variances and Inference.","authors":"Jonas Moss","doi":"10.1007/s11336-023-09945-2","DOIUrl":null,"url":null,"abstract":"<p><p>Most measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen's kappa or Fleiss's kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss's kappa, Conger's kappa, and Hubert's kappa, the variant of Fleiss's kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform.</p>","PeriodicalId":54534,"journal":{"name":"Psychometrika","volume":" ","pages":"517-541"},"PeriodicalIF":2.9000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11164747/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychometrika","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1007/s11336-023-09945-2","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/8 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Most measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen's kappa or Fleiss's kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss's kappa, Conger's kappa, and Hubert's kappa, the variant of Fleiss's kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform.

Abstract Image

多个评分者的一致性测量:弗雷谢特方差与推理。
大多数一致性测量方法都是偶然校正法。它们在三个方面存在差异:偶然一致的定义、不一致函数的选择以及如何处理多个评分者。偶然一致通常是按照科恩卡帕(Cohen's kappa)或弗莱斯卡帕(Fleiss's kappa)进行成对定义的。分歧函数通常是名义函数、二次函数或绝对值函数。但是,如何处理多个评分者却存在争议,主要的竞争者有弗莱斯卡帕(Fleiss's kappa)、康格卡帕(Conger's kappa)和休伯特卡帕(Hubert's kappa)。更一般地说,多评分者一致系数可以 g-wise 方式定义,其中分歧加权函数使用 g 个评分者而不是两个。本文有两个主要贡献(a) 我们建议使用弗雷谢特方差来处理多评分者的情况。弗雷谢特方差是直观的分歧度量,并将名义函数、二次函数和绝对值函数推广到两个以上评分者的情况。(b) 对于每个项目都由相同数量的评分者进行评分的情况,我们推导了 g-加权同意系数的极限理论,以及科恩型或弗莱斯型的偶然同意。在尝试了三种置信区间结构后,我们最终建议使用 arcsine 变换或 Fisher 变换来计算置信区间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Psychometrika
Psychometrika 数学-数学跨学科应用
CiteScore
4.40
自引率
10.00%
发文量
72
审稿时长
>12 weeks
期刊介绍: The journal Psychometrika is devoted to the advancement of theory and methodology for behavioral data in psychology, education and the social and behavioral sciences generally. Its coverage is offered in two sections: Theory and Methods (T& M), and Application Reviews and Case Studies (ARCS). T&M articles present original research and reviews on the development of quantitative models, statistical methods, and mathematical techniques for evaluating data from psychology, the social and behavioral sciences and related fields. Application Reviews can be integrative, drawing together disparate methodologies for applications, or comparative and evaluative, discussing advantages and disadvantages of one or more methodologies in applications. Case Studies highlight methodology that deepens understanding of substantive phenomena through more informative data analysis, or more elegant data description.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信