Sholem Hack , Rebecca Attal , Armin Farzad , Eran E. Alon , Eran Glikson , Eric Remer , Alberto Maria Saibene , Habib G Zalzal
{"title":"Performance of generative AI across ENT tasks: A systematic review and meta-analysis","authors":"Sholem Hack , Rebecca Attal , Armin Farzad , Eran E. Alon , Eran Glikson , Eric Remer , Alberto Maria Saibene , Habib G Zalzal","doi":"10.1016/j.anl.2025.08.010","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To systematically evaluate the diagnostic accuracy, educational utility, and communication potential of generative AI, particularly Large Language Models (LLMs) such as ChatGPT, in otolaryngology.</div></div><div><h3>Data Sources</h3><div>A comprehensive search of PubMed, Embase, Scopus, Web of Science, and IEEE Xplore identified English-language peer-reviewed studies from January 2022 to March 2025.</div></div><div><h3>Review Methods</h3><div>Eligible studies evaluated text-based generative AI models used in otolaryngology. Two reviewers screened and assessed studies using JBI and QUADAS-2 tools. A random-effects meta-analysis was conducted on diagnostic accuracy outcomes, with subgroup analyses by task type and model version.</div></div><div><h3>Results</h3><div>Ninety-one studies were included; 61 reported quantitative outcomes. Of these, 43 provided diagnostic accuracy data across 59 model-task pairs. Pooled diagnostic accuracy was 72.7 % (95 % CI: 67.4–77.6 %; I² = 93.8 %). Accuracy was highest in education (83.0 %) and diagnostic imaging tasks (84.9 %), and lowest in clinical decision support (67.1 %). GPT-4 consistently outperformed GPT-3.5 across both education and CDS domains. Hallucinations and performance variability were noted in complex clinical reasoning tasks.</div></div><div><h3>Conclusion</h3><div>Generative AI performs well in structured otolaryngology tasks, particularly education and communication. However, its inconsistent performance in clinical reasoning tasks limits standalone use. Future research should focus on hallucination mitigation, standardized evaluation, and prospective validation to guide safe clinical integration.</div></div>","PeriodicalId":55627,"journal":{"name":"Auris Nasus Larynx","volume":"52 5","pages":"Pages 585-596"},"PeriodicalIF":1.5000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Auris Nasus Larynx","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0385814625001269","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
To systematically evaluate the diagnostic accuracy, educational utility, and communication potential of generative AI, particularly Large Language Models (LLMs) such as ChatGPT, in otolaryngology.
Data Sources
A comprehensive search of PubMed, Embase, Scopus, Web of Science, and IEEE Xplore identified English-language peer-reviewed studies from January 2022 to March 2025.
Review Methods
Eligible studies evaluated text-based generative AI models used in otolaryngology. Two reviewers screened and assessed studies using JBI and QUADAS-2 tools. A random-effects meta-analysis was conducted on diagnostic accuracy outcomes, with subgroup analyses by task type and model version.
Results
Ninety-one studies were included; 61 reported quantitative outcomes. Of these, 43 provided diagnostic accuracy data across 59 model-task pairs. Pooled diagnostic accuracy was 72.7 % (95 % CI: 67.4–77.6 %; I² = 93.8 %). Accuracy was highest in education (83.0 %) and diagnostic imaging tasks (84.9 %), and lowest in clinical decision support (67.1 %). GPT-4 consistently outperformed GPT-3.5 across both education and CDS domains. Hallucinations and performance variability were noted in complex clinical reasoning tasks.
Conclusion
Generative AI performs well in structured otolaryngology tasks, particularly education and communication. However, its inconsistent performance in clinical reasoning tasks limits standalone use. Future research should focus on hallucination mitigation, standardized evaluation, and prospective validation to guide safe clinical integration.
期刊介绍:
The international journal Auris Nasus Larynx provides the opportunity for rapid, carefully reviewed publications concerning the fundamental and clinical aspects of otorhinolaryngology and related fields. This includes otology, neurotology, bronchoesophagology, laryngology, rhinology, allergology, head and neck medicine and oncologic surgery, maxillofacial and plastic surgery, audiology, speech science.
Original papers, short communications and original case reports can be submitted. Reviews on recent developments are invited regularly and Letters to the Editor commenting on papers or any aspect of Auris Nasus Larynx are welcomed.
Founded in 1973 and previously published by the Society for Promotion of International Otorhinolaryngology, the journal is now the official English-language journal of the Oto-Rhino-Laryngological Society of Japan, Inc. The aim of its new international Editorial Board is to make Auris Nasus Larynx an international forum for high quality research and clinical sciences.