人工智能能超越人类吗？3种ChatGPT模型在西班牙FIR和BIR专业健康检查中的评价

IF 1.9 Q3 MEDICAL LABORATORY TECHNOLOGY

Journal of Applied Laboratory Medicine Pub Date : 2025-09-03 DOI:10.1093/jalm/jfaf098

Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer

{"title":"人工智能能超越人类吗？3种ChatGPT模型在西班牙FIR和BIR专业健康检查中的评价","authors":"Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer","doi":"10.1093/jalm/jfaf098","DOIUrl":null,"url":null,"abstract":"Background: Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.Methods: Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.Results: After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.Conclusions: These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.","PeriodicalId":46361,"journal":{"name":"Journal of Applied Laboratory Medicine","volume":" ","pages":"1215-1225"},"PeriodicalIF":1.9000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can AI Outperform Human Aspirants? Evaluating 3 ChatGPT Models on the Spanish FIR and BIR Specialized Health Examinations.\",\"authors\":\"Gonzalo Verdú, Alejandro Calvera Rayo, Aleix B Fabregat-Bolufer\",\"doi\":\"10.1093/jalm/jfaf098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.Methods: Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.Results: After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.Conclusions: These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.\",\"PeriodicalId\":46361,\"journal\":{\"name\":\"Journal of Applied Laboratory Medicine\",\"volume\":\" \",\"pages\":\"1215-1225\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Laboratory Medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jalm/jfaf098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICAL LABORATORY TECHNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Laboratory Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jalm/jfaf098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：人工智能（AI）模型越来越多地用于需要信息综合和决策的学术和临床环境。本研究探讨了3个OpenAI模型- gpt - 40 Mini， gpt - 40和gpt - 01 -在2023年西班牙FIR（药学住院医师）和BIR（生物学家住院医师）考试中的性能，准确性和可重复性。通过评估他们在这些高度专业化测试中的能力，我们的目标是评估他们作为学术准备和临床支持的可靠工具的潜力。方法：每个模型分别使用来自2023年FIR和BIR考试的200个问题进行提示。该分析评估了总体准确性、官方考试分数和预测排名。子分析集中于基于多模态图像的问题和临床病例。通过使用Cohen Kappa和McNemar测试重新测试两个测试中的所有问题来评估再现性。结果：经过第一次尝试，gpt - 01达到了最高的准确率（FIR 92%, BIR 97.0%），在两次考试中都获得了最高的位置。gpt - 40表现异常（FIR为87%，BIR为97.5%），在BIR上超过所有人类候选人，在FIR上排名第三。gpt - 40 Mini虽然表现强劲（FIR得分为80.5%，BIR得分为93.0%），但在复杂或依赖于形象的问题上表现不佳。可重复性分析显示，gpt - 01在复测时倾向于纠正先前的错误，而gpt - 40和gpt - 40 Mini更一致地重复初始答案。结论：这些模型，特别是gpt - 01，表现优于人类考生，支持人工智能在考试准备和临床培训中的整合。然而，在多模态理解和专门化子领域方面仍然存在局限性。人为监督对于确保实验室和临床实践的可靠性仍然至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Can AI Outperform Human Aspirants? Evaluating 3 ChatGPT Models on the Spanish FIR and BIR Specialized Health Examinations.

Background: Artificial intelligence (AI) models are increasingly used in academic and clinical settings that require information synthesis and decision-making. This study explores the performance, accuracy, and reproducibility of 3 OpenAI models-GPT-4o Mini, GPT-4o, and GPT-o1-when applied to the 2023 Spanish FIR (Pharmaceutical Internal Resident) and BIR (Biologist Internal Resident) exams. By assessing their capabilities on these highly specialized tests, we aim to evaluate their potential as reliable tools for academic preparation and clinical support.

Methods: Each model was prompted with 200 questions from the 2023 FIR and BIR exams, respectively. The analysis evaluated overall accuracy, official exam scoring, and predicted ranking. Subanalyses focused on multimodal image-based questions and clinical cases. Reproducibility was assessed by retesting all questions from both exams using the Cohen Kappa and McNemar tests.

Results: After the first attempt, GPT-o1 achieved the highest accuracy (92% on FIR, 97.0% on BIR), securing top positions in both exams. GPT-4o performed exceptionally (87% on FIR, 97.5% on BIR), surpassing all human candidates on BIR and ranking third on FIR. GPT-4o Mini, while strong (80.5% on FIR, 93.0% on BIR), struggled with complex or image-reliant questions. The reproducibility analysis showed GPT-o1's tendency to correct previous mistakes on retesting, while GPT-4o and GPT-4o Mini more consistently repeated initial answers.

Conclusions: These models, particularly GPT-o1, outperformed human examinees, supporting AI integration in exam preparation and clinical training. However, limitations persist in multimodal understanding and specialized subdomains. Human oversight remains essential to ensure reliability in laboratory and clinical practice.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Applied Laboratory Medicine MEDICAL LABORATORY TECHNOLOGY-

CiteScore

3.70

自引率

5.00%

发文量

137