How Soon Will Surgeons Become Mere Technicians? Chatbot Performance in Managing Clinical Scenarios.

IF 4.9 1区 医学 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS
Darren S Bryan, Joseph J Platz, Keith S Naunheim, Mark K Ferguson
{"title":"How Soon Will Surgeons Become Mere Technicians? Chatbot Performance in Managing Clinical Scenarios.","authors":"Darren S Bryan, Joseph J Platz, Keith S Naunheim, Mark K Ferguson","doi":"10.1016/j.jtcvs.2024.11.006","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been questioned. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.</p><p><strong>Methods: </strong>Clinical scenarios were developed within domains based on the ABTS Qualifying Exam. Each scenario included three stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon and chatbot scores.</p><p><strong>Results: </strong>Examinations were completed by 21 surgeons, the majority of whom (14; 66%) practiced in academic or university settings. The median score per scenario for chatbots was 1.06compared to 1.88 for surgeons (difference 0.66, p=0.019). Surgeon median scores were better than chatbot median scores for all except two scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median 0.50 per chatbot/scenario vs. 0.19 per surgeon/scenario; p=0.016).</p><p><strong>Conclusions: </strong>Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.</p>","PeriodicalId":49975,"journal":{"name":"Journal of Thoracic and Cardiovascular Surgery","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thoracic and Cardiovascular Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jtcvs.2024.11.006","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been questioned. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.

Methods: Clinical scenarios were developed within domains based on the ABTS Qualifying Exam. Each scenario included three stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon and chatbot scores.

Results: Examinations were completed by 21 surgeons, the majority of whom (14; 66%) practiced in academic or university settings. The median score per scenario for chatbots was 1.06compared to 1.88 for surgeons (difference 0.66, p=0.019). Surgeon median scores were better than chatbot median scores for all except two scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median 0.50 per chatbot/scenario vs. 0.19 per surgeon/scenario; p=0.016).

Conclusions: Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.

外科医生多久会沦为单纯的技术员?聊天机器人在管理临床场景中的表现。
目的聊天机器人已在医学和外科领域得到应用,并被建议用于指导临床决策。然而,人工智能(AI)平台所提供信息的准确性一直受到质疑。我们评估了 4 个流行聊天机器人在董事会式考试中的表现,并将结果与一组经董事会认证的胸外科医生进行了比较:方法:根据 ABTS 资格考试的领域开发了临床场景。每个场景包括三个主干,分别用与诊断、评估和治疗相关的关键特征方法编写。ChatGPT-4 、Bard(现为 Gemini)、Perplexity 和 Claude 2 以及随机抽取的 ABTS 认证外科医生对 10 个情景进行了测试。每个场景的最高得分为 3 分。曼-惠特尼 U 检验用于比较外科医生和聊天机器人的得分:21名外科医生完成了考试,其中大部分(14人,66%)在学术或大学环境中执业。聊天机器人每个场景的中位分数为 1.06,而外科医生为 1.88(差异为 0.66,P=0.019)。除两个场景外,外科医生的中位数得分均高于聊天机器人的中位数得分。与外科医生提供的答案相比,聊天机器人的答案更容易被视为关键失败(聊天机器人/情景的中位数为 0.50,外科医生/情景的中位数为 0.19;P=0.016):结论:四种流行的聊天机器人的手术水平明显低于经委员会认证的外科医生。在临床决策中应谨慎使用人工智能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
11.20
自引率
10.00%
发文量
1079
审稿时长
68 days
期刊介绍: The Journal of Thoracic and Cardiovascular Surgery presents original, peer-reviewed articles on diseases of the heart, great vessels, lungs and thorax with emphasis on surgical interventions. An official publication of The American Association for Thoracic Surgery and The Western Thoracic Surgical Association, the Journal focuses on techniques and developments in acquired cardiac surgery, congenital cardiac repair, thoracic procedures, heart and lung transplantation, mechanical circulatory support and other procedures.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信