外科医生多久会沦为单纯的技术员？聊天机器人在管理临床场景中的表现。

IF 4.4 1区医学 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS

Journal of Thoracic and Cardiovascular Surgery Pub Date : 2025-10-01 DOI:10.1016/j.jtcvs.2024.11.006

Darren S. Bryan MD , Joseph J. Platz MD , Keith S. Naunheim MD , Mark K. Ferguson MD

{"title":"外科医生多久会沦为单纯的技术员？聊天机器人在管理临床场景中的表现。","authors":"Darren S. Bryan MD , Joseph J. Platz MD , Keith S. Naunheim MD , Mark K. Ferguson MD","doi":"10.1016/j.jtcvs.2024.11.006","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Chatbot<span><span> use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by </span>artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.</span></div></div><div><h3>Methods</h3><div><span>Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney </span><em>U</em> test was used to compare surgeon scores and chatbot scores.</div></div><div><h3>Results</h3><div>Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; <em>P</em> = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; <em>P</em> = .016).</div></div><div><h3>Conclusions</h3><div>Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.</div></div>","PeriodicalId":49975,"journal":{"name":"Journal of Thoracic and Cardiovascular Surgery","volume":"170 4","pages":"Pages 1179-1184"},"PeriodicalIF":4.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How soon will surgeons become mere technicians? Chatbot performance in managing clinical scenarios\",\"authors\":\"Darren S. Bryan MD , Joseph J. Platz MD , Keith S. Naunheim MD , Mark K. Ferguson MD\",\"doi\":\"10.1016/j.jtcvs.2024.11.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Chatbot<span><span> use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by </span>artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.</span></div></div><div><h3>Methods</h3><div><span>Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney </span><em>U</em> test was used to compare surgeon scores and chatbot scores.</div></div><div><h3>Results</h3><div>Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; <em>P</em> = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; <em>P</em> = .016).</div></div><div><h3>Conclusions</h3><div>Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.</div></div>\",\"PeriodicalId\":49975,\"journal\":{\"name\":\"Journal of Thoracic and Cardiovascular Surgery\",\"volume\":\"170 4\",\"pages\":\"Pages 1179-1184\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Thoracic and Cardiovascular Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0022522324010341\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CARDIAC & CARDIOVASCULAR SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thoracic and Cardiovascular Surgery","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0022522324010341","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

目的聊天机器人已在医学和外科领域得到应用，并被建议用于指导临床决策。然而，人工智能（AI）平台所提供信息的准确性一直受到质疑。我们评估了 4 个流行聊天机器人在董事会式考试中的表现，并将结果与一组经董事会认证的胸外科医生进行了比较：方法：根据 ABTS 资格考试的领域开发了临床场景。每个场景包括三个主干，分别用与诊断、评估和治疗相关的关键特征方法编写。ChatGPT-4 、Bard（现为 Gemini）、Perplexity 和 Claude 2 以及随机抽取的 ABTS 认证外科医生对 10 个情景进行了测试。每个场景的最高得分为 3 分。曼-惠特尼 U 检验用于比较外科医生和聊天机器人的得分：21名外科医生完成了考试，其中大部分（14人，66%）在学术或大学环境中执业。聊天机器人每个场景的中位分数为 1.06，而外科医生为 1.88（差异为 0.66，P=0.019）。除两个场景外，外科医生的中位数得分均高于聊天机器人的中位数得分。与外科医生提供的答案相比，聊天机器人的答案更容易被视为关键失败（聊天机器人/情景的中位数为 0.50，外科医生/情景的中位数为 0.19；P=0.016）：结论：四种流行的聊天机器人的手术水平明显低于经委员会认证的外科医生。在临床决策中应谨慎使用人工智能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

How soon will surgeons become mere technicians? Chatbot performance in managing clinical scenarios

Background

Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.

Methods

Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon scores and chatbot scores.

Results

Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; P = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; P = .016).

Conclusions

Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Thoracic and Cardiovascular Surgery 医学-呼吸系统

CiteScore

11.20

自引率

10.00%

发文量

1079

审稿时长

68 days

期刊介绍： The Journal of Thoracic and Cardiovascular Surgery presents original, peer-reviewed articles on diseases of the heart, great vessels, lungs and thorax with emphasis on surgical interventions. An official publication of The American Association for Thoracic Surgery and The Western Thoracic Surgical Association, the Journal focuses on techniques and developments in acquired cardiac surgery, congenital cardiac repair, thoracic procedures, heart and lung transplantation, mechanical circulatory support and other procedures.