Using Keystroke Behavior Patterns to Detect Nonauthentic Texts in Writing Assessments: Evaluating the Fairness of Predictive Models

IF 1.6 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement Pub Date : 2024-10-18 DOI:10.1111/jedm.12416

Yang Jiang, Mo Zhang, Jiangang Hao, Paul Deane, Chen Li

{"title":"Using Keystroke Behavior Patterns to Detect Nonauthentic Texts in Writing Assessments: Evaluating the Fairness of Predictive Models","authors":"Yang Jiang, Mo Zhang, Jiangang Hao, Paul Deane, Chen Li","doi":"10.1111/jedm.12416","DOIUrl":null,"url":null,"abstract":"<p>The emergence of sophisticated AI tools such as ChatGPT, coupled with the transition to remote delivery of educational assessments in the COVID-19 era, has led to increasing concerns about academic integrity and test security. Using AI tools, test takers can produce high-quality texts effortlessly and use them to game assessments. It is thus critical to detect these nonauthentic texts to ensure test integrity. In this study, we leveraged keystroke logs—recordings of every keypress—to build machine learning (ML) detectors of nonauthentic texts in a large-scale writing assessment. We focused on investigating the fairness of the detectors across demographic subgroups to ensure that nongenuine writing can be predicted equally well across subgroups. Results indicated that keystroke dynamics were effective in identifying nonauthentic texts. While the ML models were slightly more likely to misclassify the original responses submitted by male test takers as consisting of nonauthentic texts than those submitted by females, the effect sizes were negligible. Furthermore, balancing demographic distributions and class labels did not consistently mitigate detector bias across predictive models. Findings of this study not only provide implications for using behavioral data to address test security issues, but also highlight the importance of evaluating the fairness of predictive models in educational contexts.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"571-594"},"PeriodicalIF":1.6000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Measurement","FirstCategoryId":"102","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jedm.12416","RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PSYCHOLOGY, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

The emergence of sophisticated AI tools such as ChatGPT, coupled with the transition to remote delivery of educational assessments in the COVID-19 era, has led to increasing concerns about academic integrity and test security. Using AI tools, test takers can produce high-quality texts effortlessly and use them to game assessments. It is thus critical to detect these nonauthentic texts to ensure test integrity. In this study, we leveraged keystroke logs—recordings of every keypress—to build machine learning (ML) detectors of nonauthentic texts in a large-scale writing assessment. We focused on investigating the fairness of the detectors across demographic subgroups to ensure that nongenuine writing can be predicted equally well across subgroups. Results indicated that keystroke dynamics were effective in identifying nonauthentic texts. While the ML models were slightly more likely to misclassify the original responses submitted by male test takers as consisting of nonauthentic texts than those submitted by females, the effect sizes were negligible. Furthermore, balancing demographic distributions and class labels did not consistently mitigate detector bias across predictive models. Findings of this study not only provide implications for using behavioral data to address test security issues, but also highlight the importance of evaluating the fairness of predictive models in educational contexts.

查看原文本刊更多论文

使用击键行为模式检测写作评估中的非真实文本：评估预测模型的公平性

ChatGPT等复杂人工智能工具的出现，加上新冠肺炎时代教育评估向远程交付的过渡，导致人们越来越关注学术诚信和考试安全。使用人工智能工具，考生可以毫不费力地编写高质量的文本，并将其用于游戏评估。因此，检测这些非真实文本以确保测试的完整性是至关重要的。在这项研究中，我们利用击键日志（每次击键的记录）在大规模写作评估中构建非真实文本的机器学习（ML）检测器。我们专注于调查探测器在人口统计子组中的公平性，以确保在子组中可以同样很好地预测非真实写作。结果表明，击键动力学在识别非真实文本方面是有效的。虽然ML模型更有可能将男性考生提交的原始回答错误分类为由不真实的文本组成，而不是女性考生提交的原始回答，但效应大小可以忽略不计。此外，平衡人口分布和类别标签并不能始终如一地减轻预测模型中的检测器偏差。这项研究的发现不仅为使用行为数据来解决考试安全问题提供了启示，而且还强调了在教育背景下评估预测模型公平性的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Educational Measurement Multiple-

CiteScore

2.30

自引率

7.70%

发文量

期刊介绍： The Journal of Educational Measurement (JEM) publishes original measurement research, provides reviews of measurement publications, and reports on innovative measurement applications. The topics addressed will interest those concerned with the practice of measurement in field settings, as well as be of interest to measurement theorists. In addition to presenting new contributions to measurement theory and practice, JEM also serves as a vehicle for improving educational measurement applications in a variety of settings.