Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System

IF 1.6 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement Pub Date : 2025-06-01 DOI:10.1111/jedm.12438

Wallace N. Pinto Jr, Jinnie Shin

{"title":"Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System","authors":"Wallace N. Pinto Jr, Jinnie Shin","doi":"10.1111/jedm.12438","DOIUrl":null,"url":null,"abstract":"<p>In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates the use of attribution scores in ASAG systems, focusing on their consistency in reflecting model decisions. Specifically, we examined how attribution scores generated by different methods—namely Local Interpretable Model-agnostic Explanations (LIME), Integrated Gradients (IG), Hierarchical Explanation via Divisive Generation (HEDGE), and Leave-One-Out (LOO)—compare in their consistency and ability to illustrate the decision-making processes of transformer-based scoring systems trained on a publicly available response dataset. Additionally, we analyzed how attribution scores varied across different scoring categories in a polytomously scored response dataset and across two transformer-based scoring model architectures: Bidirectional Encoder Representations from Transformers (BERT) and Decoding-enhanced BERT with Disentangled Attention (DeBERTa-v2). Our findings highlight the challenges in evaluating explainability metrics, with important implications for both high-stakes and formative assessment contexts. This study contributes to the development of more reliable and transparent ASAG systems.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"248-281"},"PeriodicalIF":1.6000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Measurement","FirstCategoryId":"102","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jedm.12438","RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PSYCHOLOGY, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates the use of attribution scores in ASAG systems, focusing on their consistency in reflecting model decisions. Specifically, we examined how attribution scores generated by different methods—namely Local Interpretable Model-agnostic Explanations (LIME), Integrated Gradients (IG), Hierarchical Explanation via Divisive Generation (HEDGE), and Leave-One-Out (LOO)—compare in their consistency and ability to illustrate the decision-making processes of transformer-based scoring systems trained on a publicly available response dataset. Additionally, we analyzed how attribution scores varied across different scoring categories in a polytomously scored response dataset and across two transformer-based scoring model architectures: Bidirectional Encoder Representations from Transformers (BERT) and Decoding-enhanced BERT with Disentangled Attention (DeBERTa-v2). Our findings highlight the challenges in evaluating explainability metrics, with important implications for both high-stakes and formative assessment contexts. This study contributes to the development of more reliable and transparent ASAG systems.

查看原文本刊更多论文

评估自动简答评分（ASAG）系统中归属方法的一致性和可靠性：迈向一个可解释的评分系统

近年来，可解释性技术在自动作文评分和自动简答评分（ASAG）模型中的应用，特别是那些基于变压器架构的模型，获得了极大的关注。然而，这些技术的可靠性和一致性仍有待进一步研究。本研究系统地调查了归因分数在ASAG系统中的使用，重点关注它们在反映模型决策方面的一致性。具体来说，我们研究了由不同方法生成的归因分数——即局部可解释模型不可知论解释（LIME）、综合梯度（IG）、通过分裂生成的分层解释（HEDGE）和Leave-One-Out （LOO）——在一致性和说明基于公共响应数据集训练的基于变压器的评分系统的决策过程的能力方面进行了比较。此外，我们还分析了在多分制评分的响应数据集中，归因分数在不同评分类别之间的差异，以及在两种基于变压器的评分模型架构之间的差异：来自变压器的双向编码器表示（BERT）和带有解纠缠注意力的解码增强的BERT （DeBERTa-v2）。我们的研究结果强调了评估可解释性指标的挑战，对高风险和形成性评估环境都具有重要意义。这项研究有助于开发更可靠和透明的ASAG系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Educational Measurement Multiple-

CiteScore

2.30

自引率

7.70%

发文量

期刊介绍： The Journal of Educational Measurement (JEM) publishes original measurement research, provides reviews of measurement publications, and reports on innovative measurement applications. The topics addressed will interest those concerned with the practice of measurement in field settings, as well as be of interest to measurement theorists. In addition to presenting new contributions to measurement theory and practice, JEM also serves as a vehicle for improving educational measurement applications in a variety of settings.