{"title":"使用 GPT-3.5 对物理概念性问题的书面回答进行人类水平部分学分评分,仅使用提示工程学","authors":"Zhongzhou Chen, Tong Wan","doi":"arxiv-2407.15251","DOIUrl":null,"url":null,"abstract":"Large language modules (LLMs) have great potential for auto-grading student\nwritten responses to physics problems due to their capacity to process and\ngenerate natural language. In this explorative study, we use a prompt\nengineering technique, which we name \"scaffolded chain of thought (COT)\", to\ninstruct GPT-3.5 to grade student written responses to a physics conceptual\nquestion. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to\nexplicitly compare student responses to a detailed, well-explained rubric\nbefore generating the grading outcome. We show that when compared to human\nraters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%\nhigher than conventional COT. The level of agreement between AI and human\nraters can reach 70% - 80%, comparable to the level between two human raters.\nThis shows promise that an LLM-based AI grader can achieve human-level grading\naccuracy on a physics conceptual problem using prompt engineering techniques\nalone.","PeriodicalId":501565,"journal":{"name":"arXiv - PHYS - Physics Education","volume":"245 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering\",\"authors\":\"Zhongzhou Chen, Tong Wan\",\"doi\":\"arxiv-2407.15251\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language modules (LLMs) have great potential for auto-grading student\\nwritten responses to physics problems due to their capacity to process and\\ngenerate natural language. In this explorative study, we use a prompt\\nengineering technique, which we name \\\"scaffolded chain of thought (COT)\\\", to\\ninstruct GPT-3.5 to grade student written responses to a physics conceptual\\nquestion. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to\\nexplicitly compare student responses to a detailed, well-explained rubric\\nbefore generating the grading outcome. We show that when compared to human\\nraters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%\\nhigher than conventional COT. The level of agreement between AI and human\\nraters can reach 70% - 80%, comparable to the level between two human raters.\\nThis shows promise that an LLM-based AI grader can achieve human-level grading\\naccuracy on a physics conceptual problem using prompt engineering techniques\\nalone.\",\"PeriodicalId\":501565,\"journal\":{\"name\":\"arXiv - PHYS - Physics Education\",\"volume\":\"245 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Physics Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.15251\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Physics Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15251","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering
Large language modules (LLMs) have great potential for auto-grading student
written responses to physics problems due to their capacity to process and
generate natural language. In this explorative study, we use a prompt
engineering technique, which we name "scaffolded chain of thought (COT)", to
instruct GPT-3.5 to grade student written responses to a physics conceptual
question. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to
explicitly compare student responses to a detailed, well-explained rubric
before generating the grading outcome. We show that when compared to human
raters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%
higher than conventional COT. The level of agreement between AI and human
raters can reach 70% - 80%, comparable to the level between two human raters.
This shows promise that an LLM-based AI grader can achieve human-level grading
accuracy on a physics conceptual problem using prompt engineering techniques
alone.