{"title":"评估大型语言模型在美国委员会麻醉学风格的麻醉学问题:准确性,领域一致性,和临床意义。","authors":"Sagar Patel, Vinh Ngo, Brian Wilhelmi","doi":"10.1053/j.jvca.2025.05.033","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.</p>","PeriodicalId":15176,"journal":{"name":"Journal of cardiothoracic and vascular anesthesia","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.\",\"authors\":\"Sagar Patel, Vinh Ngo, Brian Wilhelmi\",\"doi\":\"10.1053/j.jvca.2025.05.033\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.</p>\",\"PeriodicalId\":15176,\"journal\":{\"name\":\"Journal of cardiothoracic and vascular anesthesia\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of cardiothoracic and vascular anesthesia\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1053/j.jvca.2025.05.033\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ANESTHESIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of cardiothoracic and vascular anesthesia","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1053/j.jvca.2025.05.033","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ANESTHESIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
大型语言模型(llm)的最新进展使人们对其在医学教育和临床实践中的潜在应用越来越感兴趣。本研究评估了五个广泛使用和高度发展的LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot和meta -是否可以在美国麻醉委员会(ABA)基础考试中获得及格分数。每个模型完成了三套独立的200个选择题,这些选择题来源于广泛使用的复习资源,内容分布反映了ABA BASIC考试蓝图。所有五个模型的性能均高于70%的通过阈值(p < 0.05),其平均值如下:ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, Meta: 85.8%。此外,方差分析比较他们的平均准确率得分,发现他们之间没有统计学意义的差异(F = 1.88, p = 0.190)。这些发现表明,目前的法学硕士可以超过委员会认证所需的最低能力,这就提出了关于他们未来在医学教育和临床决策中的作用的重要问题。在心脏、胸廓和血管麻醉学的核心主题(如血流动力学管理、心肺生理学和凝血)上的表现尤其突出,这表明与研究员水平的教育和复杂的术中护理相关。虽然这些结果突出了人工智能(AI)满足标准化医学知识基准的能力,但其更广泛的影响超出了考试成绩。随着人工智能的不断发展,它与实时患者护理的整合可能会改变麻醉实践,提供决策支持工具,帮助医生合成复杂的临床数据。需要进一步的研究来探索人工智能驱动技术在患者护理环境中的可靠性、伦理考虑和实际应用。
Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.
Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.
期刊介绍:
The Journal of Cardiothoracic and Vascular Anesthesia is primarily aimed at anesthesiologists who deal with patients undergoing cardiac, thoracic or vascular surgical procedures. JCVA features a multidisciplinary approach, with contributions from cardiac, vascular and thoracic surgeons, cardiologists, and other related specialists. Emphasis is placed on rapid publication of clinically relevant material.