大语言模型在建筑管理毕业设计项目评分中的表现

IF 2 3区工程技术 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computer Applications in Engineering Education Pub Date : 2024-09-14 DOI:10.1002/cae.22796

Gabriel Castelblanco, Laura Cruz-Castro, Zhenlin Yang

{"title":"大语言模型在建筑管理毕业设计项目评分中的表现","authors":"Gabriel Castelblanco, Laura Cruz-Castro, Zhenlin Yang","doi":"10.1002/cae.22796","DOIUrl":null,"url":null,"abstract":"<p>Grading is one of the most relevant hurdles for instructors, diverting instructor's focus on the development of engaging learning activities, class preparation, and attending to students' questions. Institutions and instructors are continuously looking for alternatives to reduce educators' time required on grading, frequently, resulting in hiring teaching assistants whose inexperience and frequent rotation can lead to inconsistent and subjective evaluations. Large Language Models (LLMs) like GPT-4 may alleviate grading challenges; however, research in this field is limited when dealing with assignments requiring specialized knowledge, complex critical thinking, subjective, and creative. This research investigates whether GPT-4's scores correlate with human grading in a construction capstone project and how the use of criteria and rubrics in GPT influences this correlation. Projects were graded by two human graders and three training configurations in GPT-4: no detailed criteria, paraphrased criteria, and explicit rubrics. Each configuration was tested through 10 iterations to evaluate GPT consistency. Results challenge GPT-4's potential to grade argumentative assignments. GPT-4's score correlates slightly better (although poor overall) with human evaluations when no additional information is provided, underscoring the poor impact of the specificity of training materials for GPT scoring in this type of assignment. Despite the LLMs' promises, their limitations include variability in consistency and reliance on statistical pattern analysis, which can lead to misleading evaluations along with privacy concerns when handling sensitive student data. Educators must carefully design grading guidelines to harness the full potential of LLMs in academic assessments, balancing AI's efficiency with the need for nuanced human judgment.</p>","PeriodicalId":50643,"journal":{"name":"Computer Applications in Engineering Education","volume":"32 6","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of a Large-Language Model in scoring construction management capstone design projects\",\"authors\":\"Gabriel Castelblanco, Laura Cruz-Castro, Zhenlin Yang\",\"doi\":\"10.1002/cae.22796\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Grading is one of the most relevant hurdles for instructors, diverting instructor's focus on the development of engaging learning activities, class preparation, and attending to students' questions. Institutions and instructors are continuously looking for alternatives to reduce educators' time required on grading, frequently, resulting in hiring teaching assistants whose inexperience and frequent rotation can lead to inconsistent and subjective evaluations. Large Language Models (LLMs) like GPT-4 may alleviate grading challenges; however, research in this field is limited when dealing with assignments requiring specialized knowledge, complex critical thinking, subjective, and creative. This research investigates whether GPT-4's scores correlate with human grading in a construction capstone project and how the use of criteria and rubrics in GPT influences this correlation. Projects were graded by two human graders and three training configurations in GPT-4: no detailed criteria, paraphrased criteria, and explicit rubrics. Each configuration was tested through 10 iterations to evaluate GPT consistency. Results challenge GPT-4's potential to grade argumentative assignments. GPT-4's score correlates slightly better (although poor overall) with human evaluations when no additional information is provided, underscoring the poor impact of the specificity of training materials for GPT scoring in this type of assignment. Despite the LLMs' promises, their limitations include variability in consistency and reliance on statistical pattern analysis, which can lead to misleading evaluations along with privacy concerns when handling sensitive student data. Educators must carefully design grading guidelines to harness the full potential of LLMs in academic assessments, balancing AI's efficiency with the need for nuanced human judgment.</p>\",\"PeriodicalId\":50643,\"journal\":{\"name\":\"Computer Applications in Engineering Education\",\"volume\":\"32 6\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Applications in Engineering Education\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cae.22796\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Applications in Engineering Education","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cae.22796","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

评分是教员面临的最大障碍之一，它分散了教员在开展有吸引力的学习活动、备课和回答学生问题方面的精力。院校和教师都在不断寻找替代方案，以减少教育工作者在评分上所需的时间，结果往往是聘用助教，而助教经验不足和频繁轮换可能会导致不一致和主观的评价。像 GPT-4 这样的大型语言模型（LLM）可以缓解评分难题；然而，在处理需要专业知识、复杂批判性思维、主观性和创造性的作业时，该领域的研究十分有限。本研究调查了 GPT-4 的评分是否与建筑毕业设计项目中的人工评分相关，以及 GPT 中标准和评分标准的使用如何影响这种相关性。项目由两名人工评分员评分，GPT-4 中有三种训练配置：无详细标准、解析标准和明确的评分标准。每种配置都经过了 10 次反复测试，以评估 GPT 的一致性。测试结果对 GPT-4 为论证性作业评分的潜力提出了质疑。在不提供额外信息的情况下，GPT-4 的评分与人类评价的相关性稍好一些（尽管总体上较差），这突出说明了培训材料的具体性对这类作业的 GPT 评分影响不大。尽管 LLMs 很有前途，但其局限性包括一致性不稳定和依赖于统计模式分析，这可能导致误导性评价，以及在处理敏感学生数据时的隐私问题。教育工作者必须精心设计评分指南，以充分发挥 LLM 在学术评估中的潜力，同时平衡人工智能的效率和人类细微判断的需要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance of a Large-Language Model in scoring construction management capstone design projects

Grading is one of the most relevant hurdles for instructors, diverting instructor's focus on the development of engaging learning activities, class preparation, and attending to students' questions. Institutions and instructors are continuously looking for alternatives to reduce educators' time required on grading, frequently, resulting in hiring teaching assistants whose inexperience and frequent rotation can lead to inconsistent and subjective evaluations. Large Language Models (LLMs) like GPT-4 may alleviate grading challenges; however, research in this field is limited when dealing with assignments requiring specialized knowledge, complex critical thinking, subjective, and creative. This research investigates whether GPT-4's scores correlate with human grading in a construction capstone project and how the use of criteria and rubrics in GPT influences this correlation. Projects were graded by two human graders and three training configurations in GPT-4: no detailed criteria, paraphrased criteria, and explicit rubrics. Each configuration was tested through 10 iterations to evaluate GPT consistency. Results challenge GPT-4's potential to grade argumentative assignments. GPT-4's score correlates slightly better (although poor overall) with human evaluations when no additional information is provided, underscoring the poor impact of the specificity of training materials for GPT scoring in this type of assignment. Despite the LLMs' promises, their limitations include variability in consistency and reliance on statistical pattern analysis, which can lead to misleading evaluations along with privacy concerns when handling sensitive student data. Educators must carefully design grading guidelines to harness the full potential of LLMs in academic assessments, balancing AI's efficiency with the need for nuanced human judgment.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Applications in Engineering Education 工程技术-工程：综合

CiteScore

7.20

自引率

10.30%

发文量

100

审稿时长

6-12 weeks

期刊介绍： Computer Applications in Engineering Education provides a forum for publishing peer-reviewed timely information on the innovative uses of computers, Internet, and software tools in engineering education. Besides new courses and software tools, the CAE journal covers areas that support the integration of technology-based modules in the engineering curriculum and promotes discussion of the assessment and dissemination issues associated with these new implementation methods.