While GPT-3.5 is unable to pass the Physician Licensing Exam in Taiwan, GPT-4 successfully meets the criteria.

Tsung-An Chen, Kuan-Chen Lin, Ming-Hwai Lin, Hsiao-Ting Chang, Yu-Chun Chen, Tzeng-Ji Chen
{"title":"While GPT-3.5 is unable to pass the Physician Licensing Exam in Taiwan, GPT-4 successfully meets the criteria.","authors":"Tsung-An Chen, Kuan-Chen Lin, Ming-Hwai Lin, Hsiao-Ting Chang, Yu-Chun Chen, Tzeng-Ji Chen","doi":"10.1097/JCMA.0000000000001225","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This study investigates the performance of ChatGPT-3.5 and ChatGPT-4 in answering medical questions from Taiwan's Physician Licensing Exam, ranging from basic medical knowledge to specialized clinical topics. It aims to understand these artificial intelligence (AI) models' capabilities in a non-English context, specifically traditional Chinese.</p><p><strong>Methods: </strong>The study incorporated questions from the Taiwan Physician Licensing Exam in 2022, excluding image-based queries. Each question was manually input into ChatGPT, and responses were compared with official answers from Taiwan's Ministry of Examination. Differences across specialties and question types were assessed using the Kruskal-Wallis and Fisher's exact tests.</p><p><strong>Results: </strong>ChatGPT-3.5 achieved an average accuracy of 67.7% in basic medical sciences and 53.2% in clinical medicine. Meanwhile, ChatGPT-4 significantly outperformed ChatGPT-3.5, with average accuracies of 91.9% and 90.7%, respectively. ChatGPT-3.5 scored above 60.0% in 7 out of 10 basic medical science subjects and 3 out of 14 clinical subjects, while ChatGPT-4 scored above 60.0% in every subject. The type of question did not significantly affect accuracy rates.</p><p><strong>Conclusion: </strong>ChatGPT-3.5 showed proficiency in basic medical sciences but was less reliable in clinical medicine, whereas ChatGPT-4 demonstrated strong capabilities in both areas. However, their proficiency varied across different specialties. The type of question had minimal impact on performance. This study highlights the potential of AI models in medical education and non-English languages examination and the need for cautious and informed implementation in educational settings due to variability across specialties.</p>","PeriodicalId":94115,"journal":{"name":"Journal of the Chinese Medical Association : JCMA","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Chinese Medical Association : JCMA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/JCMA.0000000000001225","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: This study investigates the performance of ChatGPT-3.5 and ChatGPT-4 in answering medical questions from Taiwan's Physician Licensing Exam, ranging from basic medical knowledge to specialized clinical topics. It aims to understand these artificial intelligence (AI) models' capabilities in a non-English context, specifically traditional Chinese.

Methods: The study incorporated questions from the Taiwan Physician Licensing Exam in 2022, excluding image-based queries. Each question was manually input into ChatGPT, and responses were compared with official answers from Taiwan's Ministry of Examination. Differences across specialties and question types were assessed using the Kruskal-Wallis and Fisher's exact tests.

Results: ChatGPT-3.5 achieved an average accuracy of 67.7% in basic medical sciences and 53.2% in clinical medicine. Meanwhile, ChatGPT-4 significantly outperformed ChatGPT-3.5, with average accuracies of 91.9% and 90.7%, respectively. ChatGPT-3.5 scored above 60.0% in 7 out of 10 basic medical science subjects and 3 out of 14 clinical subjects, while ChatGPT-4 scored above 60.0% in every subject. The type of question did not significantly affect accuracy rates.

Conclusion: ChatGPT-3.5 showed proficiency in basic medical sciences but was less reliable in clinical medicine, whereas ChatGPT-4 demonstrated strong capabilities in both areas. However, their proficiency varied across different specialties. The type of question had minimal impact on performance. This study highlights the potential of AI models in medical education and non-English languages examination and the need for cautious and informed implementation in educational settings due to variability across specialties.

虽然GPT-3.5无法通过台湾医师执照考试,但GPT-4成功地满足了标准。
背景:本研究探讨ChatGPT-3.5与ChatGPT-4在回答台湾医师执业资格考试中从基本医学知识到专业临床课题的医学问题上的表现。它旨在了解这些人工智能(AI)模型在非英语环境下的能力,特别是繁体中文。方法:该研究纳入了2022年台湾医师执照考试中的问题,不包括基于图像的问题。每个问题都是人工输入ChatGPT,并将答案与台湾考试部的官方答案进行比较。使用Kruskal-Wallis和Fisher的精确测试来评估专业和问题类型之间的差异。结果:ChatGPT-3.5在基础医学领域平均准确率为67.7%,在临床医学领域平均准确率为53.2%。同时,ChatGPT-4显著优于ChatGPT-3.5,平均准确率分别为91.9%和90.7%。ChatGPT-3.5在10门基础医学科目中有7门得分超过60.0%,在14门临床科目中有3门得分超过60.0%,而ChatGPT-4在所有科目中得分超过60.0%。问题类型对准确率没有显著影响。结论:ChatGPT-3.5在基础医学方面较熟练,但在临床医学方面可靠性较差,而ChatGPT-4在基础医学和临床医学方面均表现出较强的能力。然而,他们的熟练程度在不同的专业中有所不同。问题的类型对性能的影响最小。这项研究强调了人工智能模型在医学教育和非英语语言考试中的潜力,以及由于各专业的差异,在教育环境中谨慎和知情地实施人工智能模型的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信