Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions.

medRxiv - Medical Education Pub Date : 2024-07-02 DOI:10.1101/2024.06.29.24309595

Phil Newton, Chris J Summers, Uzman Zaheer, Maira Xiromeriti, Jemima R Stokes, Jaskaran Singh Bhangu, Elis G Roome, Alanna Roberts-Phillips, Darius Mazaheri-Asadi, Cameron D Jones, Stuart Hughes, Dominic Gilbert, Ewan Jones, Keioni Essex, Emily C Rees, Ross Davey, Adrienne A Cox, Jessica A Bassett

{"title":"Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions.","authors":"Phil Newton, Chris J Summers, Uzman Zaheer, Maira Xiromeriti, Jemima R Stokes, Jaskaran Singh Bhangu, Elis G Roome, Alanna Roberts-Phillips, Darius Mazaheri-Asadi, Cameron D Jones, Stuart Hughes, Dominic Gilbert, Ewan Jones, Keioni Essex, Emily C Rees, Ross Davey, Adrienne A Cox, Jessica A Bassett","doi":"10.1101/2024.06.29.24309595","DOIUrl":null,"url":null,"abstract":"ChatGPT apparently shows excellent performance on high level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has also shown weaker performance on questions with pictures, and there have been concerns that ChatGPTs performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test, and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show a slightly reduced performance on questions containing images, particularly when the answer options were added to an image as text labels.\nThese data demonstrate that the performance of ChatGPT continues to improve and that online unproctored exams are an invalid form of assessment of the foundational knowledge needed for higher order learning.","PeriodicalId":501387,"journal":{"name":"medRxiv - Medical Education","volume":"108 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.06.29.24309595","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

ChatGPT apparently shows excellent performance on high level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has also shown weaker performance on questions with pictures, and there have been concerns that ChatGPTs performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test, and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show a slightly reduced performance on questions containing images, particularly when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that online unproctored exams are an invalid form of assessment of the foundational knowledge needed for higher order learning.

查看原文本刊更多论文

ChatGPT-4o 真的能通过医学考试吗？利用新颖问题进行实用分析

ChatGPT 显然在高水平的专业考试（如涉及医学评估和执照的考试）中表现出色。这引发了人们对 ChatGPT 可能被用于学术不端行为的担忧，尤其是在未经监考的在线考试中。然而，ChatGPT 在带有图片的试题上的表现也较弱，而且有人担心 ChatGPT 的表现可能会因为测试样题的公开性而被人为夸大，这意味着这些样题很可能是 ChatGPT 培训资料的一部分。因此有人建议，可以通过在每次考试中使用新问题和广泛使用基于图片的问题来减少作弊。这些方法仍未得到验证。在这里，我们测试了 ChatGPT-4o 在英国和美国现有医学执照考试以及基于这些考试的新问题上的表现。ChatGPT-4o 在英国医学执照考试应用知识测试中的得分率为 94%，在美国医学执照考试步骤 1 中的得分率为 89.9%。当试题被改写成新颖的版本时，或者在没有任何现有试题基础的完全新颖的试题上，成绩都没有下降。这些数据表明，ChatGPT 的性能在不断提高，而在线未经监考的考试是评估高阶学习所需的基础知识的一种无效形式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv - Medical Education

自引率

0.00%

发文量