{"title":"Performance of GPT-4o and o1-Pro on United Kingdom Medical Licensing Assessment-style items: a comparative study.","authors":"Behrad Vakili, Aadam Ahmad, Mahsa Zolfaghari","doi":"10.3352/jeehp.2025.22.30","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) such as ChatGPT, and their potential to support autonomous learning for licensing exams like the UK Medical Licensing Assessment (UKMLA), are of growing interest. However, empirical evaluations of artificial intelligence (AI) performance against the UKMLA standard remain limited.</p><p><strong>Methods: </strong>We evaluated the performance of 2 recent ChatGPT versions, GPT-4o and o1-Pro, on a curated set of 374 UKMLA-style single-best-answer items spanning diverse medical specialties. Statistical comparisons using McNemar's test assessed the significance of differences between the 2 models. Specialties were analyzed to identify domain-specific variation. In addition, 20 image-based items were evaluated.</p><p><strong>Results: </strong>GPT-4o achieved an accuracy of 88.8%, while o1-Pro achieved 93.0%. McNemar's test revealed a statistically significant difference in favor of o1-Pro. Across specialties, both models demonstrated excellent performance in surgery, psychiatry, and infectious diseases. Notable differences arose in dermatology, respiratory medicine, and imaging, where o1-Pro consistently outperformed GPT-4o. Nevertheless, isolated weaknesses in general practice were observed. The analysis of image-based items showed 75% accuracy for GPT-4o and 90% for o1-Pro (P=0.25).</p><p><strong>Conclusion: </strong>ChatGPT shows strong potential as an adjunct learning tool for UKMLA preparation, with both models achieving scores above the calculated pass mark. This underscores the promise of advanced AI models in medical education. However, specialty-specific inconsistencies suggest AI tools should complement, rather than replace, traditional study methods.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"30"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Evaluation for Health Professions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3352/jeehp.2025.22.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/10 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Large language models (LLMs) such as ChatGPT, and their potential to support autonomous learning for licensing exams like the UK Medical Licensing Assessment (UKMLA), are of growing interest. However, empirical evaluations of artificial intelligence (AI) performance against the UKMLA standard remain limited.
Methods: We evaluated the performance of 2 recent ChatGPT versions, GPT-4o and o1-Pro, on a curated set of 374 UKMLA-style single-best-answer items spanning diverse medical specialties. Statistical comparisons using McNemar's test assessed the significance of differences between the 2 models. Specialties were analyzed to identify domain-specific variation. In addition, 20 image-based items were evaluated.
Results: GPT-4o achieved an accuracy of 88.8%, while o1-Pro achieved 93.0%. McNemar's test revealed a statistically significant difference in favor of o1-Pro. Across specialties, both models demonstrated excellent performance in surgery, psychiatry, and infectious diseases. Notable differences arose in dermatology, respiratory medicine, and imaging, where o1-Pro consistently outperformed GPT-4o. Nevertheless, isolated weaknesses in general practice were observed. The analysis of image-based items showed 75% accuracy for GPT-4o and 90% for o1-Pro (P=0.25).
Conclusion: ChatGPT shows strong potential as an adjunct learning tool for UKMLA preparation, with both models achieving scores above the calculated pass mark. This underscores the promise of advanced AI models in medical education. However, specialty-specific inconsistencies suggest AI tools should complement, rather than replace, traditional study methods.
期刊介绍:
Journal of Educational Evaluation for Health Professions aims to provide readers the state-of-the art practical information on the educational evaluation for health professions so that to increase the quality of undergraduate, graduate, and continuing education. It is specialized in educational evaluation including adoption of measurement theory to medical health education, promotion of high stakes examination such as national licensing examinations, improvement of nationwide or international programs of education, computer-based testing, computerized adaptive testing, and medical health regulatory bodies. Its field comprises a variety of professions that address public medical health as following but not limited to: Care workers Dental hygienists Dental technicians Dentists Dietitians Emergency medical technicians Health educators Medical record technicians Medical technologists Midwives Nurses Nursing aides Occupational therapists Opticians Oriental medical doctors Oriental medicine dispensers Oriental pharmacists Pharmacists Physical therapists Physicians Prosthetists and Orthotists Radiological technologists Rehabilitation counselor Sanitary technicians Speech-language therapists.