To Score or Not to Score: Factors Influencing Performance and Feasibility of Automatic Content Scoring of Text Responses

IF 2.7 4区 教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH
Torsten Zesch, Andrea Horbach, Fabian Zehner
{"title":"To Score or Not to Score: Factors Influencing Performance and Feasibility of Automatic Content Scoring of Text Responses","authors":"Torsten Zesch,&nbsp;Andrea Horbach,&nbsp;Fabian Zehner","doi":"10.1111/emip.12544","DOIUrl":null,"url":null,"abstract":"<p>In this article, we systematize the factors influencing performance and feasibility of automatic content scoring methods for short text responses. We argue that performance (i.e., how well an automatic system agrees with human judgments) mainly depends on the linguistic <i>variance</i> seen in the responses and that this variance is indirectly influenced by other factors such as target population or input modality. Extending previous work, we distinguish <i>conceptual</i>, <i>realization</i>, and <i>nonconformity variance</i>, which are differentially impacted by the various factors. While conceptual variance relates to different concepts embedded in the text responses, realization variance refers to their diverse manifestation through natural language. Nonconformity variance is added by aberrant response behavior. Furthermore, besides its performance, the feasibility of using an automatic scoring system depends on external factors, such as ethical or computational constraints, which influence whether a system with a given performance is accepted by stakeholders. Our work provides (i) a framework for assessment practitioners to decide a priori whether automatic content scoring can be successfully applied in a given setup as well as (ii) new empirical findings and the integration of empirical findings from the literature on factors that influence automatic systems' performance.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 1","pages":"44-58"},"PeriodicalIF":2.7000,"publicationDate":"2023-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12544","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Measurement-Issues and Practice","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/emip.12544","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

Abstract

In this article, we systematize the factors influencing performance and feasibility of automatic content scoring methods for short text responses. We argue that performance (i.e., how well an automatic system agrees with human judgments) mainly depends on the linguistic variance seen in the responses and that this variance is indirectly influenced by other factors such as target population or input modality. Extending previous work, we distinguish conceptual, realization, and nonconformity variance, which are differentially impacted by the various factors. While conceptual variance relates to different concepts embedded in the text responses, realization variance refers to their diverse manifestation through natural language. Nonconformity variance is added by aberrant response behavior. Furthermore, besides its performance, the feasibility of using an automatic scoring system depends on external factors, such as ethical or computational constraints, which influence whether a system with a given performance is accepted by stakeholders. Our work provides (i) a framework for assessment practitioners to decide a priori whether automatic content scoring can be successfully applied in a given setup as well as (ii) new empirical findings and the integration of empirical findings from the literature on factors that influence automatic systems' performance.

Abstract Image

评分与否:影响文本回复内容自动评分性能和可行性的因素
在本文中,我们系统地分析了影响短文本自动内容评分方法性能的因素和可行性。我们认为,性能(即自动系统与人类判断的一致程度)主要取决于在响应中看到的语言差异,而这种差异间接受到目标人群或输入方式等其他因素的影响。扩展以前的工作,我们区分了概念、实现和不符合差异,它们受到各种因素的不同影响。概念差异指的是文本反应中所包含的不同概念,而实现差异指的是这些概念在自然语言中的不同表现形式。异常反应行为增加了不符合方差。此外,除了性能之外,使用自动评分系统的可行性还取决于外部因素,例如道德或计算约束,这些因素会影响具有给定性能的系统是否被利益相关者接受。我们的工作为评估从业者提供了一个框架,以先验地决定自动内容评分是否可以成功地应用于给定的设置,以及(ii)新的实证发现和对影响自动系统性能因素的文献中的实证发现的整合。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.90
自引率
15.00%
发文量
47
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信