Clinical decision support using large language models in otolaryngology: a systematic review.

IF 2.2
Rania Filali Ansary, Jerome R Lechien
{"title":"Clinical decision support using large language models in otolaryngology: a systematic review.","authors":"Rania Filali Ansary, Jerome R Lechien","doi":"10.1007/s00405-025-09504-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>This systematic review evaluated the diagnostic accuracy of large language models (LLMs) in otolaryngology-head and neck surgery clinical decision-making.</p><p><strong>Data sources: </strong>PubMed/MEDLINE, Cochrane Library, and Embase databases were searched for studies investigating clinical decision support accuracy of LLMs in otolaryngology.</p><p><strong>Review methods: </strong>Three investigators searched the literature for peer-reviewed studies investigating the application of LLMs as clinical decision support for real clinical cases according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The following outcomes were considered: diagnostic accuracy, additional examination and treatment recommendations. Study quality was assessed using the modified Methodological Index for Non-Randomized Studies (MINORS).</p><p><strong>Results: </strong>Of the 285 eligible publications, 17 met the inclusion criteria, accounting for 734 patients across various otolaryngology subspecialties. ChatGPT-4 was the most evaluated LLM (n = 14/17), followed by Claude-3/3.5 (n = 2/17), and Gemini (n = 2/17). Primary diagnostic accuracy ranged from 45.7 to 80.2% across different LLMs, with Claude often outperforming ChatGPT. LLMs demonstrated lower accuracy in recommending appropriate additional examinations (10-29%) and treatments (16.7-60%), with substantial subspecialty variability. Treatment recommendation accuracy was highest in head and neck oncology (55-60%) and lowest in rhinology (16.7%). There was substantial heterogeneity across studies for the inclusion criteria, information entered in the application programming interface, and the methods of accuracy assessment.</p><p><strong>Conclusions: </strong>LLMs demonstrate promising moderate diagnostic accuracy in otolaryngology clinical decision support, with higher performance in providing diagnoses than in suggesting appropriate additional examinations and treatments. Emerging findings support that Claude often outperforms ChatGPT. Methodological standardization is needed for future research.</p><p><strong>Level of evidence: </strong>NA.</p>","PeriodicalId":520614,"journal":{"name":"European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00405-025-09504-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: This systematic review evaluated the diagnostic accuracy of large language models (LLMs) in otolaryngology-head and neck surgery clinical decision-making.

Data sources: PubMed/MEDLINE, Cochrane Library, and Embase databases were searched for studies investigating clinical decision support accuracy of LLMs in otolaryngology.

Review methods: Three investigators searched the literature for peer-reviewed studies investigating the application of LLMs as clinical decision support for real clinical cases according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The following outcomes were considered: diagnostic accuracy, additional examination and treatment recommendations. Study quality was assessed using the modified Methodological Index for Non-Randomized Studies (MINORS).

Results: Of the 285 eligible publications, 17 met the inclusion criteria, accounting for 734 patients across various otolaryngology subspecialties. ChatGPT-4 was the most evaluated LLM (n = 14/17), followed by Claude-3/3.5 (n = 2/17), and Gemini (n = 2/17). Primary diagnostic accuracy ranged from 45.7 to 80.2% across different LLMs, with Claude often outperforming ChatGPT. LLMs demonstrated lower accuracy in recommending appropriate additional examinations (10-29%) and treatments (16.7-60%), with substantial subspecialty variability. Treatment recommendation accuracy was highest in head and neck oncology (55-60%) and lowest in rhinology (16.7%). There was substantial heterogeneity across studies for the inclusion criteria, information entered in the application programming interface, and the methods of accuracy assessment.

Conclusions: LLMs demonstrate promising moderate diagnostic accuracy in otolaryngology clinical decision support, with higher performance in providing diagnoses than in suggesting appropriate additional examinations and treatments. Emerging findings support that Claude often outperforms ChatGPT. Methodological standardization is needed for future research.

Level of evidence: NA.

在耳鼻喉科使用大语言模型的临床决策支持:系统回顾。
目的:本系统评价大语言模型(LLMs)在耳鼻喉头颈外科临床决策中的诊断准确性。数据来源:检索PubMed/MEDLINE、Cochrane图书馆和Embase数据库,以调查耳鼻喉科法学硕士临床决策支持准确性的研究。综述方法:三位研究者根据系统评价和荟萃分析(PRISMA)指南的首选报告项目,检索了同行评议的研究文献,调查llm作为真实临床病例临床决策支持的应用。考虑以下结果:诊断准确性,额外检查和治疗建议。采用改进的非随机研究方法学指数(minor)评估研究质量。结果:在285篇符合条件的出版物中,17篇符合纳入标准,涉及不同耳鼻喉科亚专科的734例患者。ChatGPT-4是评价最高的LLM (n = 14/17),其次是Claude-3/3.5 (n = 2/17)和Gemini (n = 2/17)。不同llm的初级诊断准确率从45.7%到80.2%不等,Claude通常优于ChatGPT。法学硕士在推荐适当的额外检查(10-29%)和治疗(16.7-60%)方面的准确性较低,存在大量的亚专业差异。治疗建议的准确性在头颈部肿瘤学中最高(55-60%),在鼻科中最低(16.7%)。在纳入标准、应用程序编程接口中输入的信息和准确性评估方法方面,研究之间存在实质性的异质性。结论:llm在耳鼻喉科临床决策支持中表现出有希望的中等诊断准确性,在提供诊断方面的表现高于建议适当的额外检查和治疗。新出现的研究结果支持Claude的表现通常优于ChatGPT。未来的研究需要方法标准化。证据等级:NA。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信