Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items.

IF 3.1 2区心理学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Psychometrika Pub Date : 2025-06-16 DOI:10.1017/psy.2025.10018

Daniel Bengs, Ulf Brefeld, Ulf Kroehne, Fabian Zehner

{"title":"Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items.","authors":"Daniel Bengs, Ulf Brefeld, Ulf Kroehne, Fabian Zehner","doi":"10.1017/psy.2025.10018","DOIUrl":null,"url":null,"abstract":"<p><p>Test items using open-ended response formats can increase an instrument's construct validity. However, traditionally, their application in educational testing requires human coders to score the responses. Manual scoring not only increases operational costs but also prohibits the use of evidence from open-ended items to inform routing decisions in adaptive designs. Using machine learning and natural language processing, automatic scoring provides classifiers that can instantly assign scores to text responses. Although optimized for agreement with manual scores, automatic scoring is not perfectly accurate and introduces an additional source of error into the response process, leading to a misspecification of the measurement model used with the manual score. We propose two joint models for manual and automatic scores of automatically scored open-ended items. Our models extend a given model from Item Response Theory for the manual scores by a component for the automatic scores, accounting for classification errors. The models were evaluated using data from the Programme for International Student Assessment (2012) and simulated data, demonstrating their capacity to mitigate the impact of classification errors on ability estimation compared to a baseline that disregards classification errors.</p>","PeriodicalId":54534,"journal":{"name":"Psychometrika","volume":" ","pages":"1-22"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychometrika","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1017/psy.2025.10018","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Test items using open-ended response formats can increase an instrument's construct validity. However, traditionally, their application in educational testing requires human coders to score the responses. Manual scoring not only increases operational costs but also prohibits the use of evidence from open-ended items to inform routing decisions in adaptive designs. Using machine learning and natural language processing, automatic scoring provides classifiers that can instantly assign scores to text responses. Although optimized for agreement with manual scores, automatic scoring is not perfectly accurate and introduces an additional source of error into the response process, leading to a misspecification of the measurement model used with the manual score. We propose two joint models for manual and automatic scores of automatically scored open-ended items. Our models extend a given model from Item Response Theory for the manual scores by a component for the automatic scores, accounting for classification errors. The models were evaluated using data from the Programme for International Student Assessment (2012) and simulated data, demonstrating their capacity to mitigate the impact of classification errors on ability estimation compared to a baseline that disregards classification errors.

查看原文本刊更多论文

开放式测试项目手动和自动得分的联合项目反应模型。

使用开放式回答格式的测试项目可以提高工具的结构效度。然而，传统上，它们在教育测试中的应用需要人类编码员对回答进行评分。人工评分不仅增加了操作成本，而且还禁止使用开放式项目的证据来为自适应设计中的路由决策提供信息。使用机器学习和自然语言处理，自动评分提供了分类器，可以立即为文本回复分配分数。尽管优化了与手动评分的一致性，但自动评分并不完全准确，并且在响应过程中引入了额外的错误来源，导致与手动评分一起使用的度量模型的错误说明。我们提出了人工和自动评分开放式项目的两种联合模型。我们的模型扩展了项目反应理论中用于手动得分的给定模型，增加了用于自动得分的组件，并考虑了分类错误。使用国际学生评估项目（2012）的数据和模拟数据对这些模型进行了评估，证明了与忽略分类错误的基线相比，它们能够减轻分类错误对能力估计的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Psychometrika 数学-数学跨学科应用

CiteScore

4.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： The journal Psychometrika is devoted to the advancement of theory and methodology for behavioral data in psychology, education and the social and behavioral sciences generally. Its coverage is offered in two sections: Theory and Methods (T& M), and Application Reviews and Case Studies (ARCS). T&M articles present original research and reviews on the development of quantitative models, statistical methods, and mathematical techniques for evaluating data from psychology, the social and behavioral sciences and related fields. Application Reviews can be integrative, drawing together disparate methodologies for applications, or comparative and evaluative, discussing advantages and disadvantages of one or more methodologies in applications. Case Studies highlight methodology that deepens understanding of substantive phenomena through more informative data analysis, or more elegant data description.