Using Large Language Models in the Diagnosis of Acute Cholecystitis: Assessing Accuracy and Guidelines Compliance.

IF 1 4区医学 Q3 SURGERY

American Surgeon Pub Date : 2025-06-01 Epub Date: 2025-03-12 DOI:10.1177/00031348251323719

Marta Goglia, Arianna Cicolani, Francesco Maria Carrano, Niccolò Petrucciani, Francesco D'Angelo, Marco Pace, Lucio Chiarini, Gianfranco Silecchia, Paolo Aurello

{"title":"Using Large Language Models in the Diagnosis of Acute Cholecystitis: Assessing Accuracy and Guidelines Compliance.","authors":"Marta Goglia, Arianna Cicolani, Francesco Maria Carrano, Niccolò Petrucciani, Francesco D'Angelo, Marco Pace, Lucio Chiarini, Gianfranco Silecchia, Paolo Aurello","doi":"10.1177/00031348251323719","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.MethodsWe evaluated ChatGPT4.0, Gemini Advanced, and GPTo1-preview on ten clinical questions. Eight derived from TG18, and two were formulated by the authors. Two authors independently rated the accuracy of each LLM's responses on a four-point scale: (1) accurate and comprehensive, (2) accurate but not comprehensive, (3) partially accurate, partially inaccurate, and (4) entirely inaccurate. A third author resolved any scoring discrepancies. Then, we comparatively analyzed the performance of ChatGPT4.0 against newer large language models (LLMs), specifically Gemini Advanced and GPTo1-preview, on the same set of questions to delineate their respective strengths and limitations.ResultsChatGPT4.0 provided consistent responses for 90% of the questions. It delivered \"accurate and comprehensive\" answers for 4/10 (40%) questions and \"accurate but not comprehensive\" answers for 5/10 (50%). One response (10%) was rated as \"partially accurate, partially inaccurate.\" Gemini Advanced demonstrated higher accuracy on some questions but yielded a similar percentage of \"partially accurate, partially inaccurate\" responses. Notably, neither model produced \"entirely inaccurate\" answers.DiscussionLLMs, such as ChatGPT and Gemini Advanced, demonstrate potential in accurately addressing clinical questions regarding acute cholecystitis. With awareness of their limitations, their careful implementation, and ongoing refinement, LLMs could serve as valuable resources for physician education and patient information, potentially improving clinical decision-making in the future.</p>","PeriodicalId":7782,"journal":{"name":"American Surgeon","volume":" ","pages":"967-977"},"PeriodicalIF":1.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Surgeon","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/00031348251323719","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/12 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"SURGERY","Score":null,"Total":0}

引用次数: 0

Abstract

BackgroundLarge language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.MethodsWe evaluated ChatGPT4.0, Gemini Advanced, and GPTo1-preview on ten clinical questions. Eight derived from TG18, and two were formulated by the authors. Two authors independently rated the accuracy of each LLM's responses on a four-point scale: (1) accurate and comprehensive, (2) accurate but not comprehensive, (3) partially accurate, partially inaccurate, and (4) entirely inaccurate. A third author resolved any scoring discrepancies. Then, we comparatively analyzed the performance of ChatGPT4.0 against newer large language models (LLMs), specifically Gemini Advanced and GPTo1-preview, on the same set of questions to delineate their respective strengths and limitations.ResultsChatGPT4.0 provided consistent responses for 90% of the questions. It delivered "accurate and comprehensive" answers for 4/10 (40%) questions and "accurate but not comprehensive" answers for 5/10 (50%). One response (10%) was rated as "partially accurate, partially inaccurate." Gemini Advanced demonstrated higher accuracy on some questions but yielded a similar percentage of "partially accurate, partially inaccurate" responses. Notably, neither model produced "entirely inaccurate" answers.DiscussionLLMs, such as ChatGPT and Gemini Advanced, demonstrate potential in accurately addressing clinical questions regarding acute cholecystitis. With awareness of their limitations, their careful implementation, and ongoing refinement, LLMs could serve as valuable resources for physician education and patient information, potentially improving clinical decision-making in the future.

查看原文本刊更多论文

使用大语言模型诊断急性胆囊炎：评估准确性和指南依从性。

大型语言模型（llm）是一种高级工具，能够理解和生成类似人类的文本。根据东京指南2018 （TG18）的概述，本研究评估了几种商业llm在解决与急性胆囊炎诊断和管理相关的临床问题方面的准确性。我们评估了他们与指南中提出的专家小组讨论的一致性。方法针对10个临床问题对ChatGPT4.0、Gemini Advanced和GPTo1-preview进行评估。8种由TG18衍生，2种由作者自行配制。两位作者独立地对每个法学硕士的回答的准确性进行了四分制的评估：(1)准确而全面，(2)准确但不全面，(3)部分准确，部分不准确，(4)完全不准确。第三位作者解决了任何评分差异。然后，我们比较分析了ChatGPT4.0与较新的大型语言模型（llm）（特别是Gemini Advanced和GPTo1-preview）在同一组问题上的性能，以描绘各自的优势和局限性。结果：schatgpt4.0对90%的问题提供了一致的答案。它对4/10（40%）的问题给出了“准确而全面”的答案，对5/10（50%）的问题给出了“准确但不全面”的答案。一个回答（10%）被评为“部分准确，部分不准确”。Gemini Advanced在某些问题上表现出更高的准确性，但“部分准确，部分不准确”的回答比例也差不多。值得注意的是，两个模型都没有给出“完全不准确”的答案。法学硕士，如ChatGPT和Gemini Advanced，在准确解决急性胆囊炎的临床问题方面显示出潜力。随着对其局限性的认识，他们的仔细实施和不断完善，法学硕士可以作为医生教育和患者信息的宝贵资源，潜在地改善未来的临床决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American Surgeon 医学-外科

CiteScore

1.40

自引率

0.00%

发文量

623

期刊介绍： The American Surgeon is a monthly peer-reviewed publication published by the Southeastern Surgical Congress. Its area of concentration is clinical general surgery, as defined by the content areas of the American Board of Surgery: alimentary tract (including bariatric surgery), abdomen and its contents, breast, skin and soft tissue, endocrine system, solid organ transplantation, pediatric surgery, surgical critical care, surgical oncology (including head and neck surgery), trauma and emergency surgery, and vascular surgery.