ChatGPT's Performance Evaluation in Spreadsheets Modelling to Inform Assessments Redesign

IF 4.6 2区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Journal of Computer Assisted Learning Pub Date : 2025-05-05 DOI:10.1111/jcal.70035

Michelle Cheong

{"title":"ChatGPT's Performance Evaluation in Spreadsheets Modelling to Inform Assessments Redesign","authors":"Michelle Cheong","doi":"10.1111/jcal.70035","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Increasingly, students are using ChatGPT to assist them in learning and even completing their assessments, raising concerns of academic integrity and loss of critical thinking skills. Many articles suggested educators redesign assessments that are more ‘Generative-AI-resistant’ and to focus on assessing students on higher order thinking skills. However, there is a lack of articles that attempt to quantify assessments at different cognitive levels to provide empirical study insights on ChatGPT's performance at different levels, which will affect how educators redesign their assessments.</p>\n </section>\n \n <section>\n \n <h3> Objectives</h3>\n \n <p>Educators need new information on how well ChatGPT performs to redesign future assessments to assess their students in this new paradigm. This paper attempts to fill the gap in empirical research by using spreadsheet modelling assessments, tested using four different prompt engineering settings, to provide new knowledge to support assessment redesign. Our proposed methodology can be applied to other course modules for educators to achieve their respective insights for future assessment designs and actions.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We evaluated the performance of ChatGPT 3.5 on solving spreadsheets modelling assessment questions with multiple linked test items categorised according to the revised Bloom's taxonomy. We tested and compared the accuracy performance using four different prompt engineering settings namely Zero-Shot-Baseline (ZSB), Zero-Shot-Chain-of-Thought (ZSCoT), One-Shot (OS), and One-Shot-Chain-of-Thought (OSCoT), to establish how well ChatGPT 3.5 tackled technical questions of different cognitive learning levels for each prompt setting, and which prompt setting will be effective in enhancing ChatGPT's performance for questions at each level.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>We found that ChatGPT 3.5 was good up to Level 3 of the revised Bloom's taxonomy using ZSB, and its accuracy decreased as the cognitive level increased. From Level 4 onwards, it did not perform as well, committing many mistakes. ZSCoT would achieve modest improvements up to Level 5, making it a possible concern for instructors. OS would achieve very significant improvements for Levels 3 and 4, while OSCoT would be needed to achieve very significant improvement for Level 5. None of the prompts tested was able to improve the response quality for level 6.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>We concluded that educators must be cognizant of ChatGPT's performance for different cognitive level questions, and the enhanced performance from using suitable prompts. To develop students' critical thinking abilities, we provided four recommendations for assessment redesign which aim to mitigate the negative impact on student learning and leverage it to enhance learning, considering ChatGPT's performance at different cognitive levels.</p>\n </section>\n </div>","PeriodicalId":48071,"journal":{"name":"Journal of Computer Assisted Learning","volume":"41 3","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Assisted Learning","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jcal.70035","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Increasingly, students are using ChatGPT to assist them in learning and even completing their assessments, raising concerns of academic integrity and loss of critical thinking skills. Many articles suggested educators redesign assessments that are more ‘Generative-AI-resistant’ and to focus on assessing students on higher order thinking skills. However, there is a lack of articles that attempt to quantify assessments at different cognitive levels to provide empirical study insights on ChatGPT's performance at different levels, which will affect how educators redesign their assessments.

Objectives

Educators need new information on how well ChatGPT performs to redesign future assessments to assess their students in this new paradigm. This paper attempts to fill the gap in empirical research by using spreadsheet modelling assessments, tested using four different prompt engineering settings, to provide new knowledge to support assessment redesign. Our proposed methodology can be applied to other course modules for educators to achieve their respective insights for future assessment designs and actions.

Methods

We evaluated the performance of ChatGPT 3.5 on solving spreadsheets modelling assessment questions with multiple linked test items categorised according to the revised Bloom's taxonomy. We tested and compared the accuracy performance using four different prompt engineering settings namely Zero-Shot-Baseline (ZSB), Zero-Shot-Chain-of-Thought (ZSCoT), One-Shot (OS), and One-Shot-Chain-of-Thought (OSCoT), to establish how well ChatGPT 3.5 tackled technical questions of different cognitive learning levels for each prompt setting, and which prompt setting will be effective in enhancing ChatGPT's performance for questions at each level.

Results

We found that ChatGPT 3.5 was good up to Level 3 of the revised Bloom's taxonomy using ZSB, and its accuracy decreased as the cognitive level increased. From Level 4 onwards, it did not perform as well, committing many mistakes. ZSCoT would achieve modest improvements up to Level 5, making it a possible concern for instructors. OS would achieve very significant improvements for Levels 3 and 4, while OSCoT would be needed to achieve very significant improvement for Level 5. None of the prompts tested was able to improve the response quality for level 6.

Conclusions

We concluded that educators must be cognizant of ChatGPT's performance for different cognitive level questions, and the enhanced performance from using suitable prompts. To develop students' critical thinking abilities, we provided four recommendations for assessment redesign which aim to mitigate the negative impact on student learning and leverage it to enhance learning, considering ChatGPT's performance at different cognitive levels.

查看原文本刊更多论文

ChatGPT的性能评估在电子表格建模通知评估重新设计

越来越多的学生使用ChatGPT来帮助他们学习，甚至完成他们的评估，这引起了对学术诚信和批判性思维技能丧失的担忧。许多文章建议教育工作者重新设计更加“抗生成人工智能”的评估，并专注于评估学生的高阶思维技能。然而，缺乏尝试量化不同认知水平的评估的文章，以提供ChatGPT在不同水平上的表现的实证研究见解，这将影响教育工作者如何重新设计他们的评估。教育工作者需要关于ChatGPT表现如何的新信息，以重新设计未来的评估，以在这种新范式下评估他们的学生。本文试图通过使用电子表格建模评估来填补实证研究的空白，并使用四种不同的提示工程设置进行测试，以提供支持评估重新设计的新知识。我们提出的方法可以应用于其他课程模块，为教育工作者实现各自对未来评估设计和行动的见解。我们评估了ChatGPT 3.5在解决电子表格建模评估问题方面的性能，这些问题具有根据修订的Bloom分类法分类的多个链接测试项目。我们使用四种不同的提示工程设置，即Zero-Shot-Baseline （ZSB）、Zero-Shot-Chain-of-Thought （ZSCoT）、One-Shot （OS）和One-Shot- chain -of- thought (OSCoT)，测试和比较了准确率性能，以确定ChatGPT 3.5在每种提示设置下如何解决不同认知学习水平的技术问题。以及哪种提示设置将有效提高ChatGPT在每个级别的问题上的表现。结果我们发现ChatGPT 3.5在使用ZSB的修改后的Bloom分类法的第3级之前是良好的，随着认知水平的提高，其准确性降低。从第4级开始，它的表现就不那么好了，犯了很多错误。ZSCoT将达到5级的适度改进，使其成为教师可能关注的问题。OS将在第3级和第4级实现非常显著的改进，而OSCoT将需要在第5级实现非常显著的改进。测试的提示都不能提高级别6的响应质量。我们的结论是，教育工作者必须认识到ChatGPT在不同认知水平问题上的表现，以及通过使用合适的提示来提高表现。为了培养学生的批判性思维能力，考虑到ChatGPT在不同认知水平上的表现，我们提出了四条评估重新设计的建议，旨在减轻对学生学习的负面影响，并利用它来提高学习。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computer Assisted Learning EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

9.70

自引率

6.00%

发文量

116

期刊介绍： The Journal of Computer Assisted Learning is an international peer-reviewed journal which covers the whole range of uses of information and communication technology to support learning and knowledge exchange. It aims to provide a medium for communication among researchers as well as a channel linking researchers, practitioners, and policy makers. JCAL is also a rich source of material for master and PhD students in areas such as educational psychology, the learning sciences, instructional technology, instructional design, collaborative learning, intelligent learning systems, learning analytics, open, distance and networked learning, and educational evaluation and assessment. This is the case for formal (e.g., schools), non-formal (e.g., workplace learning) and informal learning (e.g., museums and libraries) situations and environments. Volumes often include one Special Issue which these provides readers with a broad and in-depth perspective on a specific topic. First published in 1985, JCAL continues to have the aim of making the outcomes of contemporary research and experience accessible. During this period there have been major technological advances offering new opportunities and approaches in the use of a wide range of technologies to support learning and knowledge transfer more generally. There is currently much emphasis on the use of network functionality and the challenges its appropriate uses pose to teachers/tutors working with students locally and at a distance. JCAL welcomes: -Empirical reports, single studies or programmatic series of studies on the use of computers and information technologies in learning and assessment -Critical and original meta-reviews of literature on the use of computers for learning -Empirical studies on the design and development of innovative technology-based systems for learning -Conceptual articles on issues relating to the Aims and Scope