Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.
{"title":"Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.","authors":"Sagar Patel, Vinh Ngo, Brian Wilhelmi","doi":"10.1053/j.jvca.2025.05.033","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.</p>","PeriodicalId":15176,"journal":{"name":"Journal of cardiothoracic and vascular anesthesia","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of cardiothoracic and vascular anesthesia","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1053/j.jvca.2025.05.033","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ANESTHESIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.
期刊介绍:
The Journal of Cardiothoracic and Vascular Anesthesia is primarily aimed at anesthesiologists who deal with patients undergoing cardiac, thoracic or vascular surgical procedures. JCVA features a multidisciplinary approach, with contributions from cardiac, vascular and thoracic surgeons, cardiologists, and other related specialists. Emphasis is placed on rapid publication of clinically relevant material.