Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System

IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED
Wallace N. Pinto Jr, Jinnie Shin
{"title":"Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System","authors":"Wallace N. Pinto Jr,&nbsp;Jinnie Shin","doi":"10.1111/jedm.12438","DOIUrl":null,"url":null,"abstract":"<p>In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates the use of attribution scores in ASAG systems, focusing on their consistency in reflecting model decisions. Specifically, we examined how attribution scores generated by different methods—namely Local Interpretable Model-agnostic Explanations (LIME), Integrated Gradients (IG), Hierarchical Explanation via Divisive Generation (HEDGE), and Leave-One-Out (LOO)—compare in their consistency and ability to illustrate the decision-making processes of transformer-based scoring systems trained on a publicly available response dataset. Additionally, we analyzed how attribution scores varied across different scoring categories in a polytomously scored response dataset and across two transformer-based scoring model architectures: Bidirectional Encoder Representations from Transformers (BERT) and Decoding-enhanced BERT with Disentangled Attention (DeBERTa-v2). Our findings highlight the challenges in evaluating explainability metrics, with important implications for both high-stakes and formative assessment contexts. This study contributes to the development of more reliable and transparent ASAG systems.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"248-281"},"PeriodicalIF":1.4000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Measurement","FirstCategoryId":"102","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jedm.12438","RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PSYCHOLOGY, APPLIED","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates the use of attribution scores in ASAG systems, focusing on their consistency in reflecting model decisions. Specifically, we examined how attribution scores generated by different methods—namely Local Interpretable Model-agnostic Explanations (LIME), Integrated Gradients (IG), Hierarchical Explanation via Divisive Generation (HEDGE), and Leave-One-Out (LOO)—compare in their consistency and ability to illustrate the decision-making processes of transformer-based scoring systems trained on a publicly available response dataset. Additionally, we analyzed how attribution scores varied across different scoring categories in a polytomously scored response dataset and across two transformer-based scoring model architectures: Bidirectional Encoder Representations from Transformers (BERT) and Decoding-enhanced BERT with Disentangled Attention (DeBERTa-v2). Our findings highlight the challenges in evaluating explainability metrics, with important implications for both high-stakes and formative assessment contexts. This study contributes to the development of more reliable and transparent ASAG systems.

评估自动简答评分(ASAG)系统中归属方法的一致性和可靠性:迈向一个可解释的评分系统
近年来,可解释性技术在自动作文评分和自动简答评分(ASAG)模型中的应用,特别是那些基于变压器架构的模型,获得了极大的关注。然而,这些技术的可靠性和一致性仍有待进一步研究。本研究系统地调查了归因分数在ASAG系统中的使用,重点关注它们在反映模型决策方面的一致性。具体来说,我们研究了由不同方法生成的归因分数——即局部可解释模型不可知论解释(LIME)、综合梯度(IG)、通过分裂生成的分层解释(HEDGE)和Leave-One-Out (LOO)——在一致性和说明基于公共响应数据集训练的基于变压器的评分系统的决策过程的能力方面进行了比较。此外,我们还分析了在多分制评分的响应数据集中,归因分数在不同评分类别之间的差异,以及在两种基于变压器的评分模型架构之间的差异:来自变压器的双向编码器表示(BERT)和带有解纠缠注意力的解码增强的BERT (DeBERTa-v2)。我们的研究结果强调了评估可解释性指标的挑战,对高风险和形成性评估环境都具有重要意义。这项研究有助于开发更可靠和透明的ASAG系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.30
自引率
7.70%
发文量
46
期刊介绍: The Journal of Educational Measurement (JEM) publishes original measurement research, provides reviews of measurement publications, and reports on innovative measurement applications. The topics addressed will interest those concerned with the practice of measurement in field settings, as well as be of interest to measurement theorists. In addition to presenting new contributions to measurement theory and practice, JEM also serves as a vehicle for improving educational measurement applications in a variety of settings.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信