使用 GPT-3.5 对物理概念性问题的书面回答进行人类水平部分学分评分,仅使用提示工程学

Zhongzhou Chen, Tong Wan
{"title":"使用 GPT-3.5 对物理概念性问题的书面回答进行人类水平部分学分评分,仅使用提示工程学","authors":"Zhongzhou Chen, Tong Wan","doi":"arxiv-2407.15251","DOIUrl":null,"url":null,"abstract":"Large language modules (LLMs) have great potential for auto-grading student\nwritten responses to physics problems due to their capacity to process and\ngenerate natural language. In this explorative study, we use a prompt\nengineering technique, which we name \"scaffolded chain of thought (COT)\", to\ninstruct GPT-3.5 to grade student written responses to a physics conceptual\nquestion. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to\nexplicitly compare student responses to a detailed, well-explained rubric\nbefore generating the grading outcome. We show that when compared to human\nraters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%\nhigher than conventional COT. The level of agreement between AI and human\nraters can reach 70% - 80%, comparable to the level between two human raters.\nThis shows promise that an LLM-based AI grader can achieve human-level grading\naccuracy on a physics conceptual problem using prompt engineering techniques\nalone.","PeriodicalId":501565,"journal":{"name":"arXiv - PHYS - Physics Education","volume":"245 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering\",\"authors\":\"Zhongzhou Chen, Tong Wan\",\"doi\":\"arxiv-2407.15251\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language modules (LLMs) have great potential for auto-grading student\\nwritten responses to physics problems due to their capacity to process and\\ngenerate natural language. In this explorative study, we use a prompt\\nengineering technique, which we name \\\"scaffolded chain of thought (COT)\\\", to\\ninstruct GPT-3.5 to grade student written responses to a physics conceptual\\nquestion. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to\\nexplicitly compare student responses to a detailed, well-explained rubric\\nbefore generating the grading outcome. We show that when compared to human\\nraters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%\\nhigher than conventional COT. The level of agreement between AI and human\\nraters can reach 70% - 80%, comparable to the level between two human raters.\\nThis shows promise that an LLM-based AI grader can achieve human-level grading\\naccuracy on a physics conceptual problem using prompt engineering techniques\\nalone.\",\"PeriodicalId\":501565,\"journal\":{\"name\":\"arXiv - PHYS - Physics Education\",\"volume\":\"245 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Physics Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.15251\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Physics Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15251","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大语言模块(LLMs)具有处理和生成自然语言的能力,因此在对物理问题的学生书面回答进行自动评分方面具有巨大潜力。在这项探索性研究中,我们使用了一种提示工程技术(我们将其命名为 "支架式思维链(COT)")来指导 GPT-3.5 对物理概念问题的学生书面回答进行评分。与普通的 COT 提示相比,脚手架式 COT 提示 GPT-3.5 在生成评分结果之前,会明确地将学生的回答与详细说明的评分标准进行比较。我们的研究表明,与人类评分员相比,使用支架式 COT 的 GPT-3.5 评分准确率比传统 COT 高 20% - 30%。这表明,基于 LLM 的人工智能评分员有望在物理概念问题上单独使用提示工程技术达到人类水平的评分精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering
Large language modules (LLMs) have great potential for auto-grading student written responses to physics problems due to their capacity to process and generate natural language. In this explorative study, we use a prompt engineering technique, which we name "scaffolded chain of thought (COT)", to instruct GPT-3.5 to grade student written responses to a physics conceptual question. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to explicitly compare student responses to a detailed, well-explained rubric before generating the grading outcome. We show that when compared to human raters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30% higher than conventional COT. The level of agreement between AI and human raters can reach 70% - 80%, comparable to the level between two human raters. This shows promise that an LLM-based AI grader can achieve human-level grading accuracy on a physics conceptual problem using prompt engineering techniques alone.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信