在巴西全国本科计算机科学考试中评估 ChatGPT-4 Vision

IF 3.8 3区工程技术 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

ACM Transactions on Computing Education Pub Date : 2024-06-20 DOI:10.1145/3674149

Nabor C. Mendonça

{"title":"在巴西全国本科计算机科学考试中评估 ChatGPT-4 Vision","authors":"Nabor C. Mendonça","doi":"10.1145/3674149","DOIUrl":null,"url":null,"abstract":"<p>The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI’s most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil’s 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam’s open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model’s reasoning and self-reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. A positive correlation between the model’s performance in multiple-choice questions and the performance distribution of the human participants suggests multimodal LLMs can provide a useful tool for question testing and refinement. However, the involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model’s accuracy and ensuring the fairness of high-stakes educational exams. The paper’s research materials are publicly available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.</p>","PeriodicalId":48764,"journal":{"name":"ACM Transactions on Computing Education","volume":"24 1","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating ChatGPT-4 Vision on Brazil’s National Undergraduate Computer Science Exam\",\"authors\":\"Nabor C. Mendonça\",\"doi\":\"10.1145/3674149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI’s most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil’s 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam’s open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model’s reasoning and self-reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. A positive correlation between the model’s performance in multiple-choice questions and the performance distribution of the human participants suggests multimodal LLMs can provide a useful tool for question testing and refinement. However, the involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model’s accuracy and ensuring the fairness of high-stakes educational exams. The paper’s research materials are publicly available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.</p>\",\"PeriodicalId\":48764,\"journal\":{\"name\":\"ACM Transactions on Computing Education\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2024-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Computing Education\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1145/3674149\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computing Education","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1145/3674149","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

摘要

最近，将视觉功能集成到大型语言模型（LLMs）中的做法有望在科技教育领域发挥关键作用，因为在科技教育中，图表和表格等视觉元素通常被用来改善学习体验。本研究调查了 ChatGPT-4 Vision（OpenAI 在本研究进行时最先进的视觉模型）在巴西 2021 年全国本科考试（ENADE）计算机科学本科部分的表现。通过以原始图像格式向模型展示考试的开放题和选择题，并允许根据不同的答案进行重新评估，我们得以评估模型在涉及文本和视觉内容的大规模学术评估中的推理和自我反思能力。ChatGPT-4 Vision 的表现明显优于普通考生，跻身最佳得分百分位数的前 10 名。虽然它在包含视觉元素的问题上表现出色，但在问题解释、逻辑推理和视觉敏锐度方面也遇到了挑战。该模型在多选题中的表现与人类参与者的表现分布之间存在正相关，这表明多模态 LLM 可以为问题测试和改进提供有用的工具。然而，在独立专家小组参与审查模型与答案密钥不一致的情况下，发现了一些包含含糊不清或模棱两可语句的拙劣试题，这提醒我们在未来的考试中亟需改进试题设计。我们的研究结果表明，虽然 ChatGPT-4 Vision 在多模态学业评价中大有可为，但人为监督对于验证模型的准确性和确保高风险教育考试的公平性仍然至关重要。本文的研究资料可在 https://github.com/nabormendonca/gpt-4v-enade-cs-2021 网站上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating ChatGPT-4 Vision on Brazil’s National Undergraduate Computer Science Exam

The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI’s most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil’s 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam’s open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model’s reasoning and self-reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. A positive correlation between the model’s performance in multiple-choice questions and the performance distribution of the human participants suggests multimodal LLMs can provide a useful tool for question testing and refinement. However, the involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model’s accuracy and ensuring the fairness of high-stakes educational exams. The paper’s research materials are publicly available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Computing Education EDUCATION, SCIENTIFIC DISCIPLINES-

CiteScore

6.50

自引率

16.70%

发文量

期刊介绍： ACM Transactions on Computing Education (TOCE) (formerly named JERIC, Journal on Educational Resources in Computing) covers diverse aspects of computing education: traditional computer science, computer engineering, information technology, and informatics; emerging aspects of computing; and applications of computing to other disciplines. The common characteristics shared by these papers are a scholarly approach to teaching and learning, a broad appeal to educational practitioners, and a clear connection to student learning.