Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer
{"title":"人工智能能超越人类吗?3种ChatGPT模型在西班牙FIR和BIR专业健康检查中的评价","authors":"Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer","doi":"10.1093/jalm/jfaf098","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.</p><p><strong>Methods: </strong>Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.</p><p><strong>Results: </strong>After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.</p><p><strong>Conclusions: </strong>These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.</p>","PeriodicalId":46361,"journal":{"name":"Journal of Applied Laboratory Medicine","volume":" ","pages":"1215-1225"},"PeriodicalIF":1.9000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can AI Outperform Human Aspirants? Evaluating 3 ChatGPT Models on the Spanish FIR and BIR Specialized Health Examinations.\",\"authors\":\"Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer\",\"doi\":\"10.1093/jalm/jfaf098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.</p><p><strong>Methods: </strong>Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.</p><p><strong>Results: </strong>After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.</p><p><strong>Conclusions: </strong>These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.</p>\",\"PeriodicalId\":46361,\"journal\":{\"name\":\"Journal of Applied Laboratory Medicine\",\"volume\":\" \",\"pages\":\"1215-1225\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Laboratory Medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jalm/jfaf098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICAL LABORATORY TECHNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Laboratory Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jalm/jfaf098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}
Can AI Outperform Human Aspirants? Evaluating 3 ChatGPT Models on the Spanish FIR and BIR Specialized Health Examinations.
Background: Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.
Methods: Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.
Results: After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.
Conclusions: These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.