To Score or Not to Score: Factors Influencing Performance and Feasibility of Automatic Content Scoring of Text Responses

IF 2.7 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice Pub Date : 2023-02-14 DOI:10.1111/emip.12544

Torsten Zesch, Andrea Horbach, Fabian Zehner

{"title":"To Score or Not to Score: Factors Influencing Performance and Feasibility of Automatic Content Scoring of Text Responses","authors":"Torsten Zesch, Andrea Horbach, Fabian Zehner","doi":"10.1111/emip.12544","DOIUrl":null,"url":null,"abstract":"In this article, we systematize the factors influencing performance and feasibility of automatic content scoring methods for short text responses. We argue that performance (i.e., how well an automatic system agrees with human judgments) mainly depends on the linguistic variance seen in the responses and that this variance is indirectly influenced by other factors such as target population or input modality. Extending previous work, we distinguish conceptual, realization, and nonconformity variance, which are differentially impacted by the various factors. While conceptual variance relates to different concepts embedded in the text responses, realization variance refers to their diverse manifestation through natural language. Nonconformity variance is added by aberrant response behavior. Furthermore, besides its performance, the feasibility of using an automatic scoring system depends on external factors, such as ethical or computational constraints, which influence whether a system with a given performance is accepted by stakeholders. Our work provides (i) a framework for assessment practitioners to decide a priori whether automatic content scoring can be successfully applied in a given setup as well as (ii) new empirical findings and the integration of empirical findings from the literature on factors that influence automatic systems' performance.","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"42 1","pages":"44-58"},"PeriodicalIF":2.7000,"publicationDate":"2023-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12544","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Measurement-Issues and Practice","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/emip.12544","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

In this article, we systematize the factors influencing performance and feasibility of automatic content scoring methods for short text responses. We argue that performance (i.e., how well an automatic system agrees with human judgments) mainly depends on the linguistic variance seen in the responses and that this variance is indirectly influenced by other factors such as target population or input modality. Extending previous work, we distinguish conceptual, realization, and nonconformity variance, which are differentially impacted by the various factors. While conceptual variance relates to different concepts embedded in the text responses, realization variance refers to their diverse manifestation through natural language. Nonconformity variance is added by aberrant response behavior. Furthermore, besides its performance, the feasibility of using an automatic scoring system depends on external factors, such as ethical or computational constraints, which influence whether a system with a given performance is accepted by stakeholders. Our work provides (i) a framework for assessment practitioners to decide a priori whether automatic content scoring can be successfully applied in a given setup as well as (ii) new empirical findings and the integration of empirical findings from the literature on factors that influence automatic systems' performance.

Abstract Image

查看原文本刊更多论文

评分与否：影响文本回复内容自动评分性能和可行性的因素

在本文中，我们系统地分析了影响短文本自动内容评分方法性能的因素和可行性。我们认为，性能(即自动系统与人类判断的一致程度)主要取决于在响应中看到的语言差异，而这种差异间接受到目标人群或输入方式等其他因素的影响。扩展以前的工作，我们区分了概念、实现和不符合差异，它们受到各种因素的不同影响。概念差异指的是文本反应中所包含的不同概念，而实现差异指的是这些概念在自然语言中的不同表现形式。异常反应行为增加了不符合方差。此外，除了性能之外，使用自动评分系统的可行性还取决于外部因素，例如道德或计算约束，这些因素会影响具有给定性能的系统是否被利益相关者接受。我们的工作为评估从业者提供了一个框架，以先验地决定自动内容评分是否可以成功地应用于给定的设置，以及(ii)新的实证发现和对影响自动系统性能因素的文献中的实证发现的整合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Educational Measurement-Issues and Practice Multiple-

CiteScore

3.90

自引率

15.00%

发文量