Performance of GPT-4o and o1-Pro on United Kingdom Medical Licensing Assessment-style items: a comparative study.

IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES
Behrad Vakili, Aadam Ahmad, Mahsa Zolfaghari
{"title":"Performance of GPT-4o and o1-Pro on United Kingdom Medical Licensing Assessment-style items: a comparative study.","authors":"Behrad Vakili, Aadam Ahmad, Mahsa Zolfaghari","doi":"10.3352/jeehp.2025.22.30","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) such as ChatGPT, and their potential to support autonomous learning for licensing exams like the UK Medical Licensing Assessment (UKMLA), are of growing interest. However, empirical evaluations of artificial intelligence (AI) performance against the UKMLA standard remain limited.</p><p><strong>Methods: </strong>We evaluated the performance of 2 recent ChatGPT versions, GPT-4o and o1-Pro, on a curated set of 374 UKMLA-style single-best-answer items spanning diverse medical specialties. Statistical comparisons using McNemar's test assessed the significance of differences between the 2 models. Specialties were analyzed to identify domain-specific variation. In addition, 20 image-based items were evaluated.</p><p><strong>Results: </strong>GPT-4o achieved an accuracy of 88.8%, while o1-Pro achieved 93.0%. McNemar's test revealed a statistically significant difference in favor of o1-Pro. Across specialties, both models demonstrated excellent performance in surgery, psychiatry, and infectious diseases. Notable differences arose in dermatology, respiratory medicine, and imaging, where o1-Pro consistently outperformed GPT-4o. Nevertheless, isolated weaknesses in general practice were observed. The analysis of image-based items showed 75% accuracy for GPT-4o and 90% for o1-Pro (P=0.25).</p><p><strong>Conclusion: </strong>ChatGPT shows strong potential as an adjunct learning tool for UKMLA preparation, with both models achieving scores above the calculated pass mark. This underscores the promise of advanced AI models in medical education. However, specialty-specific inconsistencies suggest AI tools should complement, rather than replace, traditional study methods.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"30"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Evaluation for Health Professions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3352/jeehp.2025.22.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/10 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: Large language models (LLMs) such as ChatGPT, and their potential to support autonomous learning for licensing exams like the UK Medical Licensing Assessment (UKMLA), are of growing interest. However, empirical evaluations of artificial intelligence (AI) performance against the UKMLA standard remain limited.

Methods: We evaluated the performance of 2 recent ChatGPT versions, GPT-4o and o1-Pro, on a curated set of 374 UKMLA-style single-best-answer items spanning diverse medical specialties. Statistical comparisons using McNemar's test assessed the significance of differences between the 2 models. Specialties were analyzed to identify domain-specific variation. In addition, 20 image-based items were evaluated.

Results: GPT-4o achieved an accuracy of 88.8%, while o1-Pro achieved 93.0%. McNemar's test revealed a statistically significant difference in favor of o1-Pro. Across specialties, both models demonstrated excellent performance in surgery, psychiatry, and infectious diseases. Notable differences arose in dermatology, respiratory medicine, and imaging, where o1-Pro consistently outperformed GPT-4o. Nevertheless, isolated weaknesses in general practice were observed. The analysis of image-based items showed 75% accuracy for GPT-4o and 90% for o1-Pro (P=0.25).

Conclusion: ChatGPT shows strong potential as an adjunct learning tool for UKMLA preparation, with both models achieving scores above the calculated pass mark. This underscores the promise of advanced AI models in medical education. However, specialty-specific inconsistencies suggest AI tools should complement, rather than replace, traditional study methods.

gpt - 40和o1-Pro在英国医疗执照评估类项目中的表现:比较研究
目的:像ChatGPT这样的大型语言模型(llm),以及它们支持像英国医疗许可评估(UKMLA)这样的许可考试自主学习的潜力,正受到越来越多的关注。然而,针对UKMLA标准的人工智能(AI)性能的实证评估仍然有限。方法:我们评估了两个最新的ChatGPT版本,gpt - 40和o1-Pro,在374个ukmla风格的单一最佳答案项目上的性能,这些项目涵盖了不同的医学专业。使用McNemar检验进行统计比较,评估两个模型之间差异的显著性。分析特征以确定特定领域的变化。此外,还评估了20个基于图像的项目。结果:gpt - 40的准确率为88.8%,o1-Pro的准确率为93.0%。McNemar的测试显示了统计学上显著的差异,有利于o1-Pro。在专业方面,这两种模型在外科、精神病学和传染病方面表现出色。在皮肤病学、呼吸医学和影像学方面出现了显著的差异,在这些方面,o1-Pro的表现始终优于gpt - 40。然而,观察到一般做法中个别的弱点。基于图像的项目分析显示gpt - 40的准确率为75%,01 - pro的准确率为90% (P=0.25)。结论:ChatGPT作为UKMLA准备的辅助学习工具具有很强的潜力,两个模型的得分都在计算的及格分以上。这凸显了先进的人工智能模型在医学教育中的前景。然而,特定专业的不一致性表明,人工智能工具应该补充而不是取代传统的学习方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
9.60
自引率
9.10%
发文量
32
审稿时长
5 weeks
期刊介绍: Journal of Educational Evaluation for Health Professions aims to provide readers the state-of-the art practical information on the educational evaluation for health professions so that to increase the quality of undergraduate, graduate, and continuing education. It is specialized in educational evaluation including adoption of measurement theory to medical health education, promotion of high stakes examination such as national licensing examinations, improvement of nationwide or international programs of education, computer-based testing, computerized adaptive testing, and medical health regulatory bodies. Its field comprises a variety of professions that address public medical health as following but not limited to: Care workers Dental hygienists Dental technicians Dentists Dietitians Emergency medical technicians Health educators Medical record technicians Medical technologists Midwives Nurses Nursing aides Occupational therapists Opticians Oriental medical doctors Oriental medicine dispensers Oriental pharmacists Pharmacists Physical therapists Physicians Prosthetists and Orthotists Radiological technologists Rehabilitation counselor Sanitary technicians Speech-language therapists.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信