评估大型语言模型在美国委员会麻醉学风格的麻醉学问题:准确性,领域一致性,和临床意义。

IF 2.3 4区 医学 Q2 ANESTHESIOLOGY
Sagar Patel, Vinh Ngo, Brian Wilhelmi
{"title":"评估大型语言模型在美国委员会麻醉学风格的麻醉学问题:准确性,领域一致性,和临床意义。","authors":"Sagar Patel, Vinh Ngo, Brian Wilhelmi","doi":"10.1053/j.jvca.2025.05.033","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.</p>","PeriodicalId":15176,"journal":{"name":"Journal of cardiothoracic and vascular anesthesia","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.\",\"authors\":\"Sagar Patel, Vinh Ngo, Brian Wilhelmi\",\"doi\":\"10.1053/j.jvca.2025.05.033\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.</p>\",\"PeriodicalId\":15176,\"journal\":{\"name\":\"Journal of cardiothoracic and vascular anesthesia\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of cardiothoracic and vascular anesthesia\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1053/j.jvca.2025.05.033\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ANESTHESIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of cardiothoracic and vascular anesthesia","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1053/j.jvca.2025.05.033","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ANESTHESIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

大型语言模型(llm)的最新进展使人们对其在医学教育和临床实践中的潜在应用越来越感兴趣。本研究评估了五个广泛使用和高度发展的LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot和meta -是否可以在美国麻醉委员会(ABA)基础考试中获得及格分数。每个模型完成了三套独立的200个选择题,这些选择题来源于广泛使用的复习资源,内容分布反映了ABA BASIC考试蓝图。所有五个模型的性能均高于70%的通过阈值(p < 0.05),其平均值如下:ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, Meta: 85.8%。此外,方差分析比较他们的平均准确率得分,发现他们之间没有统计学意义的差异(F = 1.88, p = 0.190)。这些发现表明,目前的法学硕士可以超过委员会认证所需的最低能力,这就提出了关于他们未来在医学教育和临床决策中的作用的重要问题。在心脏、胸廓和血管麻醉学的核心主题(如血流动力学管理、心肺生理学和凝血)上的表现尤其突出,这表明与研究员水平的教育和复杂的术中护理相关。虽然这些结果突出了人工智能(AI)满足标准化医学知识基准的能力,但其更广泛的影响超出了考试成绩。随着人工智能的不断发展,它与实时患者护理的整合可能会改变麻醉实践,提供决策支持工具,帮助医生合成复杂的临床数据。需要进一步的研究来探索人工智能驱动技术在患者护理环境中的可靠性、伦理考虑和实际应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.

Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.80
自引率
17.90%
发文量
606
审稿时长
37 days
期刊介绍: The Journal of Cardiothoracic and Vascular Anesthesia is primarily aimed at anesthesiologists who deal with patients undergoing cardiac, thoracic or vascular surgical procedures. JCVA features a multidisciplinary approach, with contributions from cardiac, vascular and thoracic surgeons, cardiologists, and other related specialists. Emphasis is placed on rapid publication of clinically relevant material.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信