{"title":"Using ChatGPT to score essays and short-form constructed responses","authors":"Mark D. Shermis","doi":"10.1016/j.asw.2025.100988","DOIUrl":null,"url":null,"abstract":"<div><div>This study evaluates the effectiveness of ChatGPT-4o in scoring essays and short-form constructed responses compared to human raters and traditional machine learning models. Using data from the Automated Student Assessment Prize (ASAP), ChatGPT’s performance was assessed across multiple predictive models, including linear regression, random forest, gradient boost, and XGBoost. Results indicate that while ChatGPT’s gradient boost model achieved quadratic weighted kappa (QWK) scores close to human raters for some datasets, overall performance remained inconsistent, particularly for short-form responses. The study highlights key challenges, including variability in scoring accuracy, potential biases, and limitations in aligning ChatGPT’s predictions with human scoring standards. While ChatGPT demonstrated efficiency and scalability, its leniency and variability suggest that it should not yet replace human raters in high-stakes assessments. Instead, a hybrid approach combining AI with empirical scoring models may improve reliability. Future research should focus on refining AI-driven scoring models through enhanced fine-tuning, bias mitigation, and validation with broader datasets. Ethical considerations, including fairness in automated scoring and data security, must also be addressed. This study concludes that ChatGPT holds promise as a supplementary tool in educational assessment but requires further development to ensure validity and fairness.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100988"},"PeriodicalIF":5.5000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Assessing Writing","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1075293525000753","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
Abstract
This study evaluates the effectiveness of ChatGPT-4o in scoring essays and short-form constructed responses compared to human raters and traditional machine learning models. Using data from the Automated Student Assessment Prize (ASAP), ChatGPT’s performance was assessed across multiple predictive models, including linear regression, random forest, gradient boost, and XGBoost. Results indicate that while ChatGPT’s gradient boost model achieved quadratic weighted kappa (QWK) scores close to human raters for some datasets, overall performance remained inconsistent, particularly for short-form responses. The study highlights key challenges, including variability in scoring accuracy, potential biases, and limitations in aligning ChatGPT’s predictions with human scoring standards. While ChatGPT demonstrated efficiency and scalability, its leniency and variability suggest that it should not yet replace human raters in high-stakes assessments. Instead, a hybrid approach combining AI with empirical scoring models may improve reliability. Future research should focus on refining AI-driven scoring models through enhanced fine-tuning, bias mitigation, and validation with broader datasets. Ethical considerations, including fairness in automated scoring and data security, must also be addressed. This study concludes that ChatGPT holds promise as a supplementary tool in educational assessment but requires further development to ensure validity and fairness.
期刊介绍:
Assessing Writing is a refereed international journal providing a forum for ideas, research and practice on the assessment of written language. Assessing Writing publishes articles, book reviews, conference reports, and academic exchanges concerning writing assessments of all kinds, including traditional (direct and standardised forms of) testing of writing, alternative performance assessments (such as portfolios), workplace sampling and classroom assessment. The journal focuses on all stages of the writing assessment process, including needs evaluation, assessment creation, implementation, and validation, and test development.