Darren S. Bryan MD , Joseph J. Platz MD , Keith S. Naunheim MD , Mark K. Ferguson MD
{"title":"外科医生多久会沦为单纯的技术员?聊天机器人在管理临床场景中的表现。","authors":"Darren S. Bryan MD , Joseph J. Platz MD , Keith S. Naunheim MD , Mark K. Ferguson MD","doi":"10.1016/j.jtcvs.2024.11.006","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Chatbot<span><span> use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by </span>artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.</span></div></div><div><h3>Methods</h3><div><span>Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney </span><em>U</em> test was used to compare surgeon scores and chatbot scores.</div></div><div><h3>Results</h3><div>Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; <em>P</em> = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; <em>P</em> = .016).</div></div><div><h3>Conclusions</h3><div>Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.</div></div>","PeriodicalId":49975,"journal":{"name":"Journal of Thoracic and Cardiovascular Surgery","volume":"170 4","pages":"Pages 1179-1184"},"PeriodicalIF":4.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How soon will surgeons become mere technicians? Chatbot performance in managing clinical scenarios\",\"authors\":\"Darren S. Bryan MD , Joseph J. Platz MD , Keith S. Naunheim MD , Mark K. Ferguson MD\",\"doi\":\"10.1016/j.jtcvs.2024.11.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Chatbot<span><span> use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by </span>artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.</span></div></div><div><h3>Methods</h3><div><span>Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney </span><em>U</em> test was used to compare surgeon scores and chatbot scores.</div></div><div><h3>Results</h3><div>Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; <em>P</em> = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; <em>P</em> = .016).</div></div><div><h3>Conclusions</h3><div>Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.</div></div>\",\"PeriodicalId\":49975,\"journal\":{\"name\":\"Journal of Thoracic and Cardiovascular Surgery\",\"volume\":\"170 4\",\"pages\":\"Pages 1179-1184\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Thoracic and Cardiovascular Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0022522324010341\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CARDIAC & CARDIOVASCULAR SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thoracic and Cardiovascular Surgery","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0022522324010341","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
How soon will surgeons become mere technicians? Chatbot performance in managing clinical scenarios
Background
Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.
Methods
Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon scores and chatbot scores.
Results
Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; P = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; P = .016).
Conclusions
Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.
期刊介绍:
The Journal of Thoracic and Cardiovascular Surgery presents original, peer-reviewed articles on diseases of the heart, great vessels, lungs and thorax with emphasis on surgical interventions. An official publication of The American Association for Thoracic Surgery and The Western Thoracic Surgical Association, the Journal focuses on techniques and developments in acquired cardiac surgery, congenital cardiac repair, thoracic procedures, heart and lung transplantation, mechanical circulatory support and other procedures.