Can AI Outperform Human Aspirants? Evaluating 3 ChatGPT Models on the Spanish FIR and BIR Specialized Health Examinations.

IF 1.9 Q3 MEDICAL LABORATORY TECHNOLOGY
Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer
{"title":"Can AI Outperform Human Aspirants? Evaluating 3 ChatGPT Models on the Spanish FIR and BIR Specialized Health Examinations.","authors":"Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer","doi":"10.1093/jalm/jfaf098","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.</p><p><strong>Methods: </strong>Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.</p><p><strong>Results: </strong>After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.</p><p><strong>Conclusions: </strong>These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.</p>","PeriodicalId":46361,"journal":{"name":"Journal of Applied Laboratory Medicine","volume":" ","pages":"1215-1225"},"PeriodicalIF":1.9000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Laboratory Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jalm/jfaf098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.

Methods: Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.

Results: After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.

Conclusions: These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.

人工智能能超越人类吗?3种ChatGPT模型在西班牙FIR和BIR专业健康检查中的评价
背景:人工智能(AI)模型越来越多地用于需要信息综合和决策的学术和临床环境。本研究探讨了3个OpenAI模型- gpt - 40 Mini, gpt - 40和gpt - 01 -在2023年西班牙FIR(药学住院医师)和BIR(生物学家住院医师)考试中的性能,准确性和可重复性。通过评估他们在这些高度专业化测试中的能力,我们的目标是评估他们作为学术准备和临床支持的可靠工具的潜力。方法:每个模型分别使用来自2023年FIR和BIR考试的200个问题进行提示。该分析评估了总体准确性、官方考试分数和预测排名。子分析集中于基于多模态图像的问题和临床病例。通过使用Cohen Kappa和McNemar测试重新测试两个测试中的所有问题来评估再现性。结果:经过第一次尝试,gpt - 01达到了最高的准确率(FIR 92%, BIR 97.0%),在两次考试中都获得了最高的位置。gpt - 40表现异常(FIR为87%,BIR为97.5%),在BIR上超过所有人类候选人,在FIR上排名第三。gpt - 40 Mini虽然表现强劲(FIR得分为80.5%,BIR得分为93.0%),但在复杂或依赖于形象的问题上表现不佳。可重复性分析显示,gpt - 01在复测时倾向于纠正先前的错误,而gpt - 40和gpt - 40 Mini更一致地重复初始答案。结论:这些模型,特别是gpt - 01,表现优于人类考生,支持人工智能在考试准备和临床培训中的整合。然而,在多模态理解和专门化子领域方面仍然存在局限性。人为监督对于确保实验室和临床实践的可靠性仍然至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Applied Laboratory Medicine
Journal of Applied Laboratory Medicine MEDICAL LABORATORY TECHNOLOGY-
CiteScore
3.70
自引率
5.00%
发文量
137
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信