Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0.

IF 2 Q2 EMERGENCY MEDICINE
Yavuz Yigit, Asım Enes Ozbek, Betul Dogru, Serkan Gunay, Baha AlKahlout
{"title":"Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0.","authors":"Yavuz Yigit, Asım Enes Ozbek, Betul Dogru, Serkan Gunay, Baha AlKahlout","doi":"10.1186/s12245-025-00895-3","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The field of natural language processing (NLP) has evolved significantly since its inception in the 1950s, with large language models (LLMs) now playing a crucial role in addressing medical challenges.</p><p><strong>Objectives: </strong>This study evaluates the alignment of three prominent LLMs-Gemini, Copilot, and ChatGPT-4.0-with expert consensus on imaging recommendations for acute flank pain.</p><p><strong>Methods: </strong>A total of 29 clinical vignettes representing different combinations of age, sex, pregnancy status, likelihood of stone disease, and alternative diagnoses were posed to the three LLMs (Gemini, Copilot, and ChatGPT-4.0) between March and April 2024. Responses were compared to the consensus recommendations of a multispecialty panel. The primary outcome was the rate of LLM responses matching the majority consensus. Secondary outcomes included alignment with consensus-rated perfect (9/9) or excellent (8/9) responses and agreement with any of the nine panel members.</p><p><strong>Results: </strong>Gemini aligned with the majority consensus in 65.5% of cases, compared to 41.4% for both Copilot and ChatGPT-4.0. In scenarios rated as perfect or excellent by the consensus, Gemini showed 69.5% agreement, significantly higher than Copilot and ChatGPT-4.0, both at 43.4% (p = 0.045 and < 0.001, respectively). Overall, Gemini demonstrated an agreement rate of 82.7% with any of the nine reviewers, indicating superior capability in addressing complex medical inquiries.</p><p><strong>Conclusion: </strong>Gemini consistently outperformed Copilot and ChatGPT-4.0 in aligning with expert consensus, suggesting its potential as a reliable tool in clinical decision-making. Further research is needed to enhance the reliability and accuracy of LLMs and to address the ethical and legal challenges associated with their integration into healthcare systems.</p>","PeriodicalId":13967,"journal":{"name":"International Journal of Emergency Medicine","volume":"18 1","pages":"123"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12232162/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Emergency Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s12245-025-00895-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The field of natural language processing (NLP) has evolved significantly since its inception in the 1950s, with large language models (LLMs) now playing a crucial role in addressing medical challenges.

Objectives: This study evaluates the alignment of three prominent LLMs-Gemini, Copilot, and ChatGPT-4.0-with expert consensus on imaging recommendations for acute flank pain.

Methods: A total of 29 clinical vignettes representing different combinations of age, sex, pregnancy status, likelihood of stone disease, and alternative diagnoses were posed to the three LLMs (Gemini, Copilot, and ChatGPT-4.0) between March and April 2024. Responses were compared to the consensus recommendations of a multispecialty panel. The primary outcome was the rate of LLM responses matching the majority consensus. Secondary outcomes included alignment with consensus-rated perfect (9/9) or excellent (8/9) responses and agreement with any of the nine panel members.

Results: Gemini aligned with the majority consensus in 65.5% of cases, compared to 41.4% for both Copilot and ChatGPT-4.0. In scenarios rated as perfect or excellent by the consensus, Gemini showed 69.5% agreement, significantly higher than Copilot and ChatGPT-4.0, both at 43.4% (p = 0.045 and < 0.001, respectively). Overall, Gemini demonstrated an agreement rate of 82.7% with any of the nine reviewers, indicating superior capability in addressing complex medical inquiries.

Conclusion: Gemini consistently outperformed Copilot and ChatGPT-4.0 in aligning with expert consensus, suggesting its potential as a reliable tool in clinical decision-making. Further research is needed to enhance the reliability and accuracy of LLMs and to address the ethical and legal challenges associated with their integration into healthcare systems.

Abstract Image

Abstract Image

Abstract Image

评估肾绞痛成像推荐的大型语言模型:Gemini、copilot和ChatGPT-4.0的比较分析。
背景:自然语言处理(NLP)领域自20世纪50年代成立以来已经有了显著的发展,大型语言模型(llm)现在在解决医疗挑战方面发挥着至关重要的作用。目的:本研究评估了三个著名的LLMs-Gemini, Copilot和chatgpt -4.0-对急性侧腹疼痛的影像学建议的专家共识。方法:在2024年3月至4月期间,对3名LLMs (Gemini、Copilot和ChatGPT-4.0)进行29项临床研究,这些研究代表了年龄、性别、妊娠状况、结石疾病可能性和其他诊断的不同组合。将反应与多专业小组的共识建议进行比较。主要结果是法学硕士回应的比率与大多数共识相匹配。次要结果包括与一致评价的完美(9/9)或优秀(8/9)反应的一致性,以及与9名小组成员中的任何一位的一致性。结果:Gemini在65.5%的情况下与大多数人的共识一致,而Copilot和ChatGPT-4.0的这一比例为41.4%。在被共识评为完美或优秀的场景中,Gemini的一致性为69.5%,显著高于Copilot和ChatGPT-4.0,两者均为43.4% (p = 0.045)。结论:Gemini在与专家共识一致方面始终优于Copilot和ChatGPT-4.0,表明其有潜力成为临床决策的可靠工具。需要进一步的研究来提高法学硕士的可靠性和准确性,并解决与他们整合到医疗保健系统相关的伦理和法律挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.60
自引率
0.00%
发文量
63
审稿时长
13 weeks
期刊介绍: The aim of the journal is to bring to light the various clinical advancements and research developments attained over the world and thus help the specialty forge ahead. It is directed towards physicians and medical personnel undergoing training or working within the field of Emergency Medicine. Medical students who are interested in pursuing a career in Emergency Medicine will also benefit from the journal. This is particularly useful for trainees in countries where the specialty is still in its infancy. Disciplines covered will include interesting clinical cases, the latest evidence-based practice and research developments in Emergency medicine including emergency pediatrics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信