使用 GPT-3.5 对物理概念性问题的书面回答进行人类水平部分学分评分，仅使用提示工程学

arXiv - PHYS - Physics Education Pub Date : 2024-07-21 DOI:arxiv-2407.15251

Zhongzhou Chen, Tong Wan

{"title":"使用 GPT-3.5 对物理概念性问题的书面回答进行人类水平部分学分评分，仅使用提示工程学","authors":"Zhongzhou Chen, Tong Wan","doi":"arxiv-2407.15251","DOIUrl":null,"url":null,"abstract":"Large language modules (LLMs) have great potential for auto-grading student\nwritten responses to physics problems due to their capacity to process and\ngenerate natural language. In this explorative study, we use a prompt\nengineering technique, which we name \"scaffolded chain of thought (COT)\", to\ninstruct GPT-3.5 to grade student written responses to a physics conceptual\nquestion. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to\nexplicitly compare student responses to a detailed, well-explained rubric\nbefore generating the grading outcome. We show that when compared to human\nraters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%\nhigher than conventional COT. The level of agreement between AI and human\nraters can reach 70% - 80%, comparable to the level between two human raters.\nThis shows promise that an LLM-based AI grader can achieve human-level grading\naccuracy on a physics conceptual problem using prompt engineering techniques\nalone.","PeriodicalId":501565,"journal":{"name":"arXiv - PHYS - Physics Education","volume":"245 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering\",\"authors\":\"Zhongzhou Chen, Tong Wan\",\"doi\":\"arxiv-2407.15251\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language modules (LLMs) have great potential for auto-grading student\\nwritten responses to physics problems due to their capacity to process and\\ngenerate natural language. In this explorative study, we use a prompt\\nengineering technique, which we name \\\"scaffolded chain of thought (COT)\\\", to\\ninstruct GPT-3.5 to grade student written responses to a physics conceptual\\nquestion. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to\\nexplicitly compare student responses to a detailed, well-explained rubric\\nbefore generating the grading outcome. We show that when compared to human\\nraters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%\\nhigher than conventional COT. The level of agreement between AI and human\\nraters can reach 70% - 80%, comparable to the level between two human raters.\\nThis shows promise that an LLM-based AI grader can achieve human-level grading\\naccuracy on a physics conceptual problem using prompt engineering techniques\\nalone.\",\"PeriodicalId\":501565,\"journal\":{\"name\":\"arXiv - PHYS - Physics Education\",\"volume\":\"245 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Physics Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.15251\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Physics Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15251","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大语言模块（LLMs）具有处理和生成自然语言的能力，因此在对物理问题的学生书面回答进行自动评分方面具有巨大潜力。在这项探索性研究中，我们使用了一种提示工程技术（我们将其命名为 "支架式思维链（COT）"）来指导 GPT-3.5 对物理概念问题的学生书面回答进行评分。与普通的 COT 提示相比，脚手架式 COT 提示 GPT-3.5 在生成评分结果之前，会明确地将学生的回答与详细说明的评分标准进行比较。我们的研究表明，与人类评分员相比，使用支架式 COT 的 GPT-3.5 评分准确率比传统 COT 高 20% - 30%。这表明，基于 LLM 的人工智能评分员有望在物理概念问题上单独使用提示工程技术达到人类水平的评分精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering

Large language modules (LLMs) have great potential for auto-grading student written responses to physics problems due to their capacity to process and generate natural language. In this explorative study, we use a prompt engineering technique, which we name "scaffolded chain of thought (COT)", to instruct GPT-3.5 to grade student written responses to a physics conceptual question. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to explicitly compare student responses to a detailed, well-explained rubric before generating the grading outcome. We show that when compared to human raters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30% higher than conventional COT. The level of agreement between AI and human raters can reach 70% - 80%, comparable to the level between two human raters. This shows promise that an LLM-based AI grader can achieve human-level grading accuracy on a physics conceptual problem using prompt engineering techniques alone.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - PHYS - Physics Education

自引率

0.00%

发文量