Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer
{"title":"Can AI Outperform Human Aspirants? Evaluating 3 ChatGPT Models on the Spanish FIR and BIR Specialized Health Examinations.","authors":"Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer","doi":"10.1093/jalm/jfaf098","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.</p><p><strong>Methods: </strong>Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.</p><p><strong>Results: </strong>After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.</p><p><strong>Conclusions: </strong>These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.</p>","PeriodicalId":46361,"journal":{"name":"Journal of Applied Laboratory Medicine","volume":" ","pages":"1215-1225"},"PeriodicalIF":1.9000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Laboratory Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jalm/jfaf098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.
Methods: Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.
Results: After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.
Conclusions: These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.