The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease

Bright Huo, Elisa Calabrese, Patricia Sylla, Sunjay Kumar, Romeo C. Ignacio, Rodolfo Oviedo, Imran Hassan, Bethany J. Slater, Andreas Kaiser, Danielle S. Walsh, Wesley Vosburg
{"title":"The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease","authors":"Bright Huo, Elisa Calabrese, Patricia Sylla, Sunjay Kumar, Romeo C. Ignacio, Rodolfo Oviedo, Imran Hassan, Bethany J. Slater, Andreas Kaiser, Danielle S. Walsh, Wesley Vosburg","doi":"10.1007/s00464-024-10807-w","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD).</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM’s when utilized for advice on surgical management of GERD. Additional training of LLM’s using evidence-based health information is needed.</p>","PeriodicalId":501625,"journal":{"name":"Surgical Endoscopy","volume":"230 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgical Endoscopy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00464-024-10807-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD).

Methods

Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages.

Results

Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity.

Conclusions

Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM’s when utilized for advice on surgical management of GERD. Additional training of LLM’s using evidence-based health information is needed.

人工智能大语言模型链接聊天机器人在胃食管反流病手术决策中的表现
背景链接大语言模型(LLM)的聊天机器人可能是医疗服务提供者和患者临床建议的有效来源。本研究评估了链接 LLM 的聊天机器人在为胃食管反流病(GERD)手术治疗提供建议方面的性能。方法根据美国胃肠内镜外科医生学会(SAGES)胃食管反流病手术治疗指南中涉及的关键问题创建了九个患者案例。我们于 2023 年 11 月 16 日询问了 ChatGPT-3.5、ChatGPT-4、Copilot、Google Bard 和 Perplexity AI 有关胃食管反流手术治疗的建议。聊天机器人的准确表现被定义为与 SAGES 指南建议一致的回复数量。结果根据 SAGES 指南,ChatGPT-4 的 5/7 (71.4%) KQ、Copilot 的 3/7 (42.9%) KQ、Google Bard 的 6/7 (85.7%) KQ 和 Perplexity 的 3/7 (42.9%) KQ 都为外科医生提供了准确的成人胃食管反流手术治疗建议。ChatGPT-4 为患者提供了 3/5 (60.0%) KQs 的准确建议,Copilot 为患者提供了 2/5 (40.0%) KQs 的准确建议,Google Bard 为患者提供了 4/5 (80.0%) KQs 的准确建议,Perplexity 为患者提供了 1/5 (20.0%) KQs 的准确建议。在一名儿科患者中,外科医生通过 ChatGPT-4 获得了 2/3 (66.7%) KQs 的准确建议,通过 Copilot 获得了 3/3 (100.0%) KQs 的准确建议,通过 Google Bard 获得了 3/3 (100.0%) KQs 的准确建议,通过 Perplexity 获得了 2/3 (66.7%) KQs 的准确建议。患者在 ChatGPT-4 的 2/2 (100.0%) KQs、Copilot 的 2/2 (100.0%) KQs、Google Bard 的 1/2 (50.0%) KQs 和 Perplexity 的 1/2 (50.0%) KQs 中获得了适当的指导。需要对使用循证健康信息的 LLM 进行更多培训。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信