Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments

Fatih Yavuz, Özgür Çelik, Gamze Yavaş Çelik
{"title":"Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments","authors":"Fatih Yavuz, Özgür Çelik, Gamze Yavaş Çelik","doi":"10.1111/bjet.13494","DOIUrl":null,"url":null,"abstract":"This study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student essays of varying quality. The grading scale comprised five domains: grammar, content, organization, style & expression and mechanics. The results revealed that fine‐tuned ChatGPT model demonstrated a very high level of reliability with an intraclass correlation (ICC) score of 0.972, Default ChatGPT model exhibited an ICC score of 0.947 and Bard showed a substantial level of reliability with an ICC score of 0.919. Additionally, a significant overlap was observed in certain domains when comparing the grades assigned by LLMs and human raters. In conclusion, the findings suggest that while LLMs demonstrated a notable consistency and potential for grading competency, further fine‐tuning and adjustment are needed for a more nuanced understanding of non‐objective essay criteria. The study not only offers insights into the potential use of LLMs in grading student essays but also highlights the need for continued development and research.\nWhat is already known about this topic\n\nLarge language models (LLMs), such as OpenAI's ChatGPT and Google's Bard, are known for their ability to generate text that mimics human‐like conversation and writing.\nLLMs can perform various tasks, including essay grading.\nIntraclass correlation (ICC) is a statistical measure used to assess the reliability of ratings given by different raters (in this case, EFL instructors and LLMs).\nWhat this paper adds\n\nThe study makes a unique contribution by directly comparing the grading performance of expert EFL instructors with two LLMs—ChatGPT and Bard—using an analytical grading scale.\nIt provides robust empirical evidence showing high reliability of LLMs in grading essays, supported by high ICC scores.\nIt specifically highlights that the overall efficacy of LLMs extends to certain domains of essay grading.\nImplications for practice and/or policyThe findings open up potential new avenues for utilizing LLMs in academic settings, particularly for grading student essays, thereby possibly alleviating workload of educators.The paper's insistence on the need for further fine‐tuning of LLMs underlines the continual interplay between technological advancement and its practical applications.The results lay down a footprint for future research in advancing the use of AI in essay grading.\n","PeriodicalId":505245,"journal":{"name":"British Journal of Educational Technology","volume":"6 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Educational Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/bjet.13494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student essays of varying quality. The grading scale comprised five domains: grammar, content, organization, style & expression and mechanics. The results revealed that fine‐tuned ChatGPT model demonstrated a very high level of reliability with an intraclass correlation (ICC) score of 0.972, Default ChatGPT model exhibited an ICC score of 0.947 and Bard showed a substantial level of reliability with an ICC score of 0.919. Additionally, a significant overlap was observed in certain domains when comparing the grades assigned by LLMs and human raters. In conclusion, the findings suggest that while LLMs demonstrated a notable consistency and potential for grading competency, further fine‐tuning and adjustment are needed for a more nuanced understanding of non‐objective essay criteria. The study not only offers insights into the potential use of LLMs in grading student essays but also highlights the need for continued development and research. What is already known about this topic Large language models (LLMs), such as OpenAI's ChatGPT and Google's Bard, are known for their ability to generate text that mimics human‐like conversation and writing. LLMs can perform various tasks, including essay grading. Intraclass correlation (ICC) is a statistical measure used to assess the reliability of ratings given by different raters (in this case, EFL instructors and LLMs). What this paper adds The study makes a unique contribution by directly comparing the grading performance of expert EFL instructors with two LLMs—ChatGPT and Bard—using an analytical grading scale. It provides robust empirical evidence showing high reliability of LLMs in grading essays, supported by high ICC scores. It specifically highlights that the overall efficacy of LLMs extends to certain domains of essay grading. Implications for practice and/or policyThe findings open up potential new avenues for utilizing LLMs in academic settings, particularly for grading student essays, thereby possibly alleviating workload of educators.The paper's insistence on the need for further fine‐tuning of LLMs underlines the continual interplay between technological advancement and its practical applications.The results lay down a footprint for future research in advancing the use of AI in essay grading.
利用大型语言模型进行 EFL 论文评分:基于评分标准的评估中的信度和效度检验
本研究调查了生成式大语言模型(LLM),特别是 ChatGPT 和谷歌的 Bard,在根据分析评分标准对高等教育中的学生作文进行评分时的有效性和可靠性。共有 15 位经验丰富的英语作为外语(EFL)的教师和两位 LLM 被要求对三篇不同质量的学生作文进行评价。评分标准包括五个方面:语法、内容、组织、风格与表达以及力学。结果表明,微调 ChatGPT 模型的信度非常高,类内相关(ICC)为 0.972;默认 ChatGPT 模型的 ICC 为 0.947;Bard 模型的 ICC 为 0.919,具有相当高的信度。此外,在比较 LLM 和人类评分员的评分时,某些领域出现了明显的重叠。总之,研究结果表明,虽然法律硕士在评分能力方面表现出了明显的一致性和潜力,但还需要进一步的微调和调整,以便对非客观性论文标准有更细致的了解。这项研究不仅为 LLMs 在学生作文评分中的潜在应用提供了见解,还强调了继续开发和研究的必要性。关于本主题的已知内容大型语言模型(LLMs),如 OpenAI 的 ChatGPT 和谷歌的 Bard,因其能够生成模仿人类对话和写作的文本而闻名。类内相关(ICC)是一种统计量度,用于评估不同评分者(本例中为 EFL 讲师和 LLMs)给出的评分的可靠性。本论文的贡献本研究通过使用分析性评分量表直接比较专家 EFL 讲师与两种 LLMs(ChatGPT 和 Bard)的评分表现,做出了独特的贡献。研究结果为在学术环境中使用 LLMs 开辟了潜在的新途径,尤其是在给学生作文评分时,从而有可能减轻教育工作者的工作量。本文坚持认为有必要对 LLM 进行进一步微调,这强调了技术进步与实际应用之间的持续互动。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信