使用大型语言模型对考试进行评分:使用ChatGPT对高等教育考试进行人类和人工智能评分的比较

IF 3 3区 教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH
Jonas Flodén
{"title":"使用大型语言模型对考试进行评分:使用ChatGPT对高等教育考试进行人类和人工智能评分的比较","authors":"Jonas Flodén","doi":"10.1002/berj.4069","DOIUrl":null,"url":null,"abstract":"<p>This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three Master's-level exams were scored using ChatGPT 3.5, and the results were compared with the teachers' scoring and the grading teachers were interviewed. In total, 463 exam responses were graded. With each response being graded at least three times, a total of 1389 gradings were conducted. For the final exam scores, 70% of ChatGPT's gradings were within 10% of the teachers' gradings and 31% within 5%. ChatGPT tended to give marginally higher scores. The agreement on grades is 30%, but 45% of the exams received an adjacent grade. On individual questions, ChatGPT is more inclined to avoid very high or very low scores. ChatGPT struggles to correctly score questions closely related to the course lectures but performs better on more general questions. The AI can generate plausible scores on university exams that, at first glance, look similar to a human grader. There are differences but it is not unlikely that two different human graders could result in similar discrepancies. During the interviews, teachers expressed their surprise at how well ChatGPT's grading matched their own. Increased use of AI can lead to ethical challenges as exams are entrusted to a machine whose decision-making criteria are not fully understood, especially concerning potential bias in training data.</p>","PeriodicalId":51410,"journal":{"name":"British Educational Research Journal","volume":"51 1","pages":"201-224"},"PeriodicalIF":3.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/berj.4069","citationCount":"0","resultStr":"{\"title\":\"Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT\",\"authors\":\"Jonas Flodén\",\"doi\":\"10.1002/berj.4069\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three Master's-level exams were scored using ChatGPT 3.5, and the results were compared with the teachers' scoring and the grading teachers were interviewed. In total, 463 exam responses were graded. With each response being graded at least three times, a total of 1389 gradings were conducted. For the final exam scores, 70% of ChatGPT's gradings were within 10% of the teachers' gradings and 31% within 5%. ChatGPT tended to give marginally higher scores. The agreement on grades is 30%, but 45% of the exams received an adjacent grade. On individual questions, ChatGPT is more inclined to avoid very high or very low scores. ChatGPT struggles to correctly score questions closely related to the course lectures but performs better on more general questions. The AI can generate plausible scores on university exams that, at first glance, look similar to a human grader. There are differences but it is not unlikely that two different human graders could result in similar discrepancies. During the interviews, teachers expressed their surprise at how well ChatGPT's grading matched their own. Increased use of AI can lead to ethical challenges as exams are entrusted to a machine whose decision-making criteria are not fully understood, especially concerning potential bias in training data.</p>\",\"PeriodicalId\":51410,\"journal\":{\"name\":\"British Educational Research Journal\",\"volume\":\"51 1\",\"pages\":\"201-224\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/berj.4069\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"British Educational Research Journal\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/berj.4069\",\"RegionNum\":3,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Educational Research Journal","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/berj.4069","RegionNum":3,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

摘要

本研究比较了生成式人工智能(GenAI)大型语言模型(LLM) ChatGPT在大学考试评分方面与人类教师的表现。调查的方面包括一致性,大差异和长度的答案。对高等教育的影响,包括教师的角色和道德,也进行了讨论。采用ChatGPT 3.5对三个硕士水平考试进行评分,并与教师评分进行比较,并对评分教师进行访谈。总共有463份试卷被评分。每个回答至少评分三次,总共进行了1389次评分。对于期末考试成绩,70%的ChatGPT评分与老师评分在10%以内,31%在5%以内。ChatGPT倾向于给出稍高的分数。分数的一致性是30%,但45%的考试得到了相近的分数。在个别问题上,ChatGPT更倾向于避免非常高或非常低的分数。ChatGPT很难正确地为与课程内容密切相关的问题打分,但在更一般的问题上表现得更好。这种人工智能可以在大学考试中得出合理的分数,乍一看,它与人类评分员很相似。虽然存在差异,但两名不同的评分者也不太可能产生类似的差异。在采访中,老师们对ChatGPT的评分与他们自己的评分如此之高感到惊讶。越来越多地使用人工智能可能会带来道德挑战,因为考试被委托给一台决策标准尚未完全理解的机器,尤其是在训练数据中潜在的偏见。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT

Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three Master's-level exams were scored using ChatGPT 3.5, and the results were compared with the teachers' scoring and the grading teachers were interviewed. In total, 463 exam responses were graded. With each response being graded at least three times, a total of 1389 gradings were conducted. For the final exam scores, 70% of ChatGPT's gradings were within 10% of the teachers' gradings and 31% within 5%. ChatGPT tended to give marginally higher scores. The agreement on grades is 30%, but 45% of the exams received an adjacent grade. On individual questions, ChatGPT is more inclined to avoid very high or very low scores. ChatGPT struggles to correctly score questions closely related to the course lectures but performs better on more general questions. The AI can generate plausible scores on university exams that, at first glance, look similar to a human grader. There are differences but it is not unlikely that two different human graders could result in similar discrepancies. During the interviews, teachers expressed their surprise at how well ChatGPT's grading matched their own. Increased use of AI can lead to ethical challenges as exams are entrusted to a machine whose decision-making criteria are not fully understood, especially concerning potential bias in training data.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
British Educational Research Journal
British Educational Research Journal EDUCATION & EDUCATIONAL RESEARCH-
CiteScore
4.70
自引率
8.70%
发文量
71
期刊介绍: The British Educational Research Journal is an international peer reviewed medium for the publication of articles of interest to researchers in education and has rapidly become a major focal point for the publication of educational research from throughout the world. For further information on the association please visit the British Educational Research Association web site. The journal is interdisciplinary in approach, and includes reports of case studies, experiments and surveys, discussions of conceptual and methodological issues and of underlying assumptions in educational research, accounts of research in progress, and book reviews.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信