Using ChatGPT to score essays and short-form constructed responses

IF 5.5 1区文学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Assessing Writing Pub Date : 2025-10-01 DOI:10.1016/j.asw.2025.100988

Mark D. Shermis

{"title":"Using ChatGPT to score essays and short-form constructed responses","authors":"Mark D. Shermis","doi":"10.1016/j.asw.2025.100988","DOIUrl":null,"url":null,"abstract":"<div><div>This study evaluates the effectiveness of ChatGPT-4o in scoring essays and short-form constructed responses compared to human raters and traditional machine learning models. Using data from the Automated Student Assessment Prize (ASAP), ChatGPT’s performance was assessed across multiple predictive models, including linear regression, random forest, gradient boost, and XGBoost. Results indicate that while ChatGPT’s gradient boost model achieved quadratic weighted kappa (QWK) scores close to human raters for some datasets, overall performance remained inconsistent, particularly for short-form responses. The study highlights key challenges, including variability in scoring accuracy, potential biases, and limitations in aligning ChatGPT’s predictions with human scoring standards. While ChatGPT demonstrated efficiency and scalability, its leniency and variability suggest that it should not yet replace human raters in high-stakes assessments. Instead, a hybrid approach combining AI with empirical scoring models may improve reliability. Future research should focus on refining AI-driven scoring models through enhanced fine-tuning, bias mitigation, and validation with broader datasets. Ethical considerations, including fairness in automated scoring and data security, must also be addressed. This study concludes that ChatGPT holds promise as a supplementary tool in educational assessment but requires further development to ensure validity and fairness.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100988"},"PeriodicalIF":5.5000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Assessing Writing","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1075293525000753","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

This study evaluates the effectiveness of ChatGPT-4o in scoring essays and short-form constructed responses compared to human raters and traditional machine learning models. Using data from the Automated Student Assessment Prize (ASAP), ChatGPT’s performance was assessed across multiple predictive models, including linear regression, random forest, gradient boost, and XGBoost. Results indicate that while ChatGPT’s gradient boost model achieved quadratic weighted kappa (QWK) scores close to human raters for some datasets, overall performance remained inconsistent, particularly for short-form responses. The study highlights key challenges, including variability in scoring accuracy, potential biases, and limitations in aligning ChatGPT’s predictions with human scoring standards. While ChatGPT demonstrated efficiency and scalability, its leniency and variability suggest that it should not yet replace human raters in high-stakes assessments. Instead, a hybrid approach combining AI with empirical scoring models may improve reliability. Future research should focus on refining AI-driven scoring models through enhanced fine-tuning, bias mitigation, and validation with broader datasets. Ethical considerations, including fairness in automated scoring and data security, must also be addressed. This study concludes that ChatGPT holds promise as a supplementary tool in educational assessment but requires further development to ensure validity and fairness.

查看原文本刊更多论文

使用ChatGPT对文章和短文进行评分

本研究评估了chatgpt - 40与人类评分者和传统机器学习模型相比，在评分文章和简短的构造反应方面的有效性。使用来自自动学生评估奖（ASAP）的数据，ChatGPT的性能通过多个预测模型进行评估，包括线性回归、随机森林、梯度增强和XGBoost。结果表明，虽然ChatGPT的梯度提升模型在某些数据集上获得了接近人类评分的二次加权kappa （QWK）分数，但总体表现仍然不一致，特别是对于简短的回答。该研究强调了关键的挑战，包括评分准确性的可变性、潜在的偏见，以及将ChatGPT的预测与人类评分标准保持一致的局限性。虽然ChatGPT展示了效率和可扩展性，但它的宽松性和可变性表明，它还不应该在高风险评估中取代人类评分员。相反，将人工智能与经验评分模型相结合的混合方法可能会提高可靠性。未来的研究应侧重于通过加强微调、减少偏见和使用更广泛的数据集进行验证来改进人工智能驱动的评分模型。道德方面的考虑，包括自动评分的公平性和数据安全，也必须得到解决。本研究得出结论，ChatGPT作为教育评估的补充工具有希望，但需要进一步发展以确保有效性和公平性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Assessing Writing Multiple-

CiteScore

6.00

自引率

17.90%

发文量

期刊介绍： Assessing Writing is a refereed international journal providing a forum for ideas, research and practice on the assessment of written language. Assessing Writing publishes articles, book reviews, conference reports, and academic exchanges concerning writing assessments of all kinds, including traditional (direct and standardised forms of) testing of writing, alternative performance assessments (such as portfolios), workplace sampling and classroom assessment. The journal focuses on all stages of the writing assessment process, including needs evaluation, assessment creation, implementation, and validation, and test development.