Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation

IF 1.6 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement Pub Date : 2023-02-19 DOI:10.1111/jedm.12360

Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi

{"title":"Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation","authors":"Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi","doi":"10.1111/jedm.12360","DOIUrl":null,"url":null,"abstract":"<p>Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"428-454"},"PeriodicalIF":1.6000,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Measurement","FirstCategoryId":"102","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jedm.12360","RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PSYCHOLOGY, APPLIED","Score":null,"Total":0}

引用次数: 1

Abstract

Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.

查看原文本刊更多论文

利用链接集提高评价响应模型估计中的连通性

与使用标准绩效指标相比，使用项目反应理论来模拟评分者效应为评分者监测和诊断提供了另一种解决方案。为了拟合这样的模型，评级数据必须充分连接，以便估计评级效应。由于在大规模测试场景中使用的流行评级设计，往往存在很大比例的缺失数据，从而产生稀疏矩阵和估计问题。在本文中，我们探讨了不同类型的连通性或联系的影响，通过使用联系集(由大多数或所有评分者评分的回答集合)带来的影响。我们还探讨了链接集的属性和组成的影响，不同评级设计产生的不同连通性，以及自动评分引擎得分的作用。在使用广义部分信用模型的评分反应版本设计监测系统时，研究结果建议使用链接集，特别是由代表满分量表的反应组成的大型链接集。结果还表明，与一个人和自动评分引擎的设计相比，双人评分设计提供了更多的连通性。此外，来自自动评分引擎的分数不能提供足够的连接性。我们讨论了操作实施和进一步研究的考虑。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Educational Measurement Multiple-

CiteScore

2.30

自引率

7.70%

发文量

期刊介绍： The Journal of Educational Measurement (JEM) publishes original measurement research, provides reviews of measurement publications, and reports on innovative measurement applications. The topics addressed will interest those concerned with the practice of measurement in field settings, as well as be of interest to measurement theorists. In addition to presenting new contributions to measurement theory and practice, JEM also serves as a vehicle for improving educational measurement applications in a variety of settings.