chatgpt - 40真的能通过医学考试吗？用新颖疑问句进行语用分析。

IF 1.8 Q2 EDUCATION, SCIENTIFIC DISCIPLINES

Medical Science Educator Pub Date : 2025-02-04 eCollection Date: 2025-04-01 DOI:10.1007/s40670-025-02293-z

Philip M Newton, Christopher J Summers, Uzman Zaheer, Maira Xiromeriti, Jemima R Stokes, Jaskaran Singh Bhangu, Elis G Roome, Alanna Roberts-Phillips, Darius Mazaheri-Asadi, Cameron D Jones, Stuart Hughes, Dominic Gilbert, Ewan Jones, Keioni Essex, Emily C Ellis, Ross Davey, Adrienne A Cox, Jessica A Bassett

{"title":"chatgpt - 40真的能通过医学考试吗？用新颖疑问句进行语用分析。","authors":"Philip M Newton, Christopher J Summers, Uzman Zaheer, Maira Xiromeriti, Jemima R Stokes, Jaskaran Singh Bhangu, Elis G Roome, Alanna Roberts-Phillips, Darius Mazaheri-Asadi, Cameron D Jones, Stuart Hughes, Dominic Gilbert, Ewan Jones, Keioni Essex, Emily C Ellis, Ross Davey, Adrienne A Cox, Jessica A Bassett","doi":"10.1007/s40670-025-02293-z","DOIUrl":null,"url":null,"abstract":"ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weaker performance on questions with pictures, and there have been concerns that ChatGPT's performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here, we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show reduced performance on questions containing images when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that secure testing environments are required for the valid assessment of both foundational and higher order learning.Supplementary information: The online version contains supplementary material available at 10.1007/s40670-025-02293-z.","PeriodicalId":37113,"journal":{"name":"Medical Science Educator","volume":"35 2","pages":"721-729"},"PeriodicalIF":1.8000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12058600/pdf/","citationCount":"0","resultStr":"{\"title\":\"Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions.\",\"authors\":\"Philip M Newton, Christopher J Summers, Uzman Zaheer, Maira Xiromeriti, Jemima R Stokes, Jaskaran Singh Bhangu, Elis G Roome, Alanna Roberts-Phillips, Darius Mazaheri-Asadi, Cameron D Jones, Stuart Hughes, Dominic Gilbert, Ewan Jones, Keioni Essex, Emily C Ellis, Ross Davey, Adrienne A Cox, Jessica A Bassett\",\"doi\":\"10.1007/s40670-025-02293-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weaker performance on questions with pictures, and there have been concerns that ChatGPT's performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here, we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show reduced performance on questions containing images when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that secure testing environments are required for the valid assessment of both foundational and higher order learning.Supplementary information: The online version contains supplementary material available at 10.1007/s40670-025-02293-z.\",\"PeriodicalId\":37113,\"journal\":{\"name\":\"Medical Science Educator\",\"volume\":\"35 2\",\"pages\":\"721-729\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12058600/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Science Educator\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s40670-025-02293-z\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Science Educator","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s40670-025-02293-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

摘要

ChatGPT在高水平的专业考试中表现出色，例如医疗评估和执照考试。这引发了人们对ChatGPT可能被用于学术不端行为的担忧，尤其是在无人监考的在线考试中。然而，ChatGPT之前在有图片的问题上表现较弱，有人担心ChatGPT的表现可能被测试样本问题的公开性质人为夸大，这意味着它们可能是ChatGPT培训材料的一部分。这导致有人建议，每次考试都使用新奇的问题，并广泛使用基于图片的问题，可以减轻作弊行为。这些方法仍未经检验。在这里，我们测试了chatgpt - 40在英国和美国现有的医疗执照考试中的表现，以及基于这些考试的新问题。chatgpt - 40在英国医疗执照考试应用知识测试中得分94%，在美国医疗执照考试第一步中得分89.9%。当问题被改写成新的版本时，或者在完全不基于任何现有问题的新问题上，表现都没有下降。当答案选项作为文本标签添加到图像中时，ChatGPT在包含图像的问题上的表现确实有所下降。这些数据表明，ChatGPT的性能在不断提高，并且需要安全的测试环境来有效地评估基础和高阶学习。补充信息：在线版本包含补充资料，提供地址为10.1007/s40670-025-02293-z。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions.

查看原文本刊更多论文

Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions.

ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weaker performance on questions with pictures, and there have been concerns that ChatGPT's performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here, we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show reduced performance on questions containing images when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that secure testing environments are required for the valid assessment of both foundational and higher order learning.

Supplementary information: The online version contains supplementary material available at 10.1007/s40670-025-02293-z.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Medical Science Educator Social Sciences-Education

CiteScore

2.90

自引率

11.80%

发文量

202

期刊介绍： Medical Science Educator is the successor of the journal JIAMSE. It is the peer-reviewed publication of the International Association of Medical Science Educators (IAMSE). The Journal offers all who teach in healthcare the most current information to succeed in their task by publishing scholarly activities, opinions, and resources in medical science education. Published articles focus on teaching the sciences fundamental to modern medicine and health, and include basic science education, clinical teaching, and the use of modern education technologies. The Journal provides the readership a better understanding of teaching and learning techniques in order to advance medical science education.