Darren S Bryan, Joseph J Platz, Keith S Naunheim, Mark K Ferguson
{"title":"How Soon Will Surgeons Become Mere Technicians? Chatbot Performance in Managing Clinical Scenarios.","authors":"Darren S Bryan, Joseph J Platz, Keith S Naunheim, Mark K Ferguson","doi":"10.1016/j.jtcvs.2024.11.006","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been questioned. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.</p><p><strong>Methods: </strong>Clinical scenarios were developed within domains based on the ABTS Qualifying Exam. Each scenario included three stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon and chatbot scores.</p><p><strong>Results: </strong>Examinations were completed by 21 surgeons, the majority of whom (14; 66%) practiced in academic or university settings. The median score per scenario for chatbots was 1.06compared to 1.88 for surgeons (difference 0.66, p=0.019). Surgeon median scores were better than chatbot median scores for all except two scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median 0.50 per chatbot/scenario vs. 0.19 per surgeon/scenario; p=0.016).</p><p><strong>Conclusions: </strong>Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.</p>","PeriodicalId":49975,"journal":{"name":"Journal of Thoracic and Cardiovascular Surgery","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thoracic and Cardiovascular Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jtcvs.2024.11.006","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been questioned. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.
Methods: Clinical scenarios were developed within domains based on the ABTS Qualifying Exam. Each scenario included three stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon and chatbot scores.
Results: Examinations were completed by 21 surgeons, the majority of whom (14; 66%) practiced in academic or university settings. The median score per scenario for chatbots was 1.06compared to 1.88 for surgeons (difference 0.66, p=0.019). Surgeon median scores were better than chatbot median scores for all except two scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median 0.50 per chatbot/scenario vs. 0.19 per surgeon/scenario; p=0.016).
Conclusions: Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.
期刊介绍:
The Journal of Thoracic and Cardiovascular Surgery presents original, peer-reviewed articles on diseases of the heart, great vessels, lungs and thorax with emphasis on surgical interventions. An official publication of The American Association for Thoracic Surgery and The Western Thoracic Surgical Association, the Journal focuses on techniques and developments in acquired cardiac surgery, congenital cardiac repair, thoracic procedures, heart and lung transplantation, mechanical circulatory support and other procedures.