Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses

Q1 Arts and Humanities

Dialogue and Discourse Pub Date : 2021-09-24 DOI:10.5210/dad.2023.101

Yaman Kumar Singla, Swapnil Parekh, Somesh Singh, J. Li, R. Shah, Changyou Chen

{"title":"Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses","authors":"Yaman Kumar Singla, Swapnil Parekh, Somesh Singh, J. Li, R. Shah, Changyou Chen","doi":"10.5210/dad.2023.101","DOIUrl":null,"url":null,"abstract":"Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.","PeriodicalId":37604,"journal":{"name":"Dialogue and Discourse","volume":"4 1","pages":"1-33"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dialogue and Discourse","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5210/dad.2023.101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Arts and Humanities","Score":null,"Total":0}

引用次数: 6

Abstract

Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.

查看原文本刊更多论文

自动作文评分系统既过度稳定又过度敏感:解释原因并提出防御建议

基于深度学习的自动作文评分(AES)系统正被积极地应用于各种高风险的教育和测试应用中。然而，很少有研究来理解和解释基于深度学习的评分算法的黑箱性质。虽然以前的研究表明评分模型很容易被愚弄，但在本文中，我们探讨了评分模型令人惊讶的对抗性脆性背后的原因。我们利用可解释性方面的最新进展来发现连贯性、内容、词汇和相关性等特征在多大程度上对自动评分机制很重要。我们用它来研究AES的过度敏感(即输出分数变化大而输入论文内容变化小)和过度稳定(即输出分数变化小而输入论文内容变化大)。我们的结果表明，自动评分模型，尽管被训练成“端到端”模型，具有丰富的上下文嵌入，如BERT，表现得像词袋模型。几个单词决定了论文的分数，而不需要任何上下文，这使得模型在很大程度上过于稳定。这与最近对预训练表征学习模型的探索性研究形成鲜明对比，后者表明丰富的语言特征(如词性和形态学)是由它们编码的。此外，我们还发现模型已经学会了数据集偏差，使它们过度敏感。在某个分数类别中存在一些高共现的单词，使模型将文章样本与该分数关联起来。这导致仅增加几个单词就会导致约95%的样本得分发生变化。为了解决这些问题，我们提出了基于检测的保护模型，该模型可以高精度地检测出过度敏感和导致过稳定的样品。我们发现我们提出的模型能够成功地检测异常归因模式和标记对抗性样本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Dialogue and Discourse Arts and Humanities-Language and Linguistics

CiteScore

1.90

自引率

0.00%

发文量

审稿时长

12 weeks

期刊介绍： D&D seeks previously unpublished, high quality articles on the analysis of discourse and dialogue that contain -experimental and/or theoretical studies related to the construction, representation, and maintenance of (linguistic) context -linguistic analysis of phenomena characteristic of discourse and/or dialogue (including, but not limited to: reference and anaphora, presupposition and accommodation, topicality and salience, implicature, ---discourse structure and rhetorical relations, discourse markers and particles, the semantics and -pragmatics of dialogue acts, questions, imperatives, non-sentential utterances, intonation, and meta--communicative phenomena such as repair and grounding) -experimental and/or theoretical studies of agents'' information states and their dynamics in conversational interaction -new analytical frameworks that advance theoretical studies of discourse and dialogue -research on systems performing coreference resolution, discourse structure parsing, event and temporal -structure, and reference resolution in multimodal communication -experimental and/or theoretical results yielding new insight into non-linguistic interaction in -communication -work on natural language understanding (including spoken language understanding), dialogue management, -reasoning, and natural language generation (including text-to-speech) in dialogue systems -work related to the design and engineering of dialogue systems (including, but not limited to: -evaluation, usability design and testing, rapid application deployment, embodied agents, affect detection, -mixed-initiative, adaptation, and user modeling). -extremely well-written surveys of existing work. Highest priority is given to research reports that are specifically written for a multidisciplinary audience. The audience is primarily researchers on discourse and dialogue and its associated fields, including computer scientists, linguists, psychologists, philosophers, roboticists, sociologists.