Soumil Prasad, Jake Langlie, Luke Pasick, Ryan Chen, Elizabeth Franzmann
{"title":"Evaluating advanced AI reasoning models: ChatGPT-4.0 and DeepSeek-R1 diagnostic performance in otolaryngology: a comparative analysis","authors":"Soumil Prasad, Jake Langlie, Luke Pasick, Ryan Chen, Elizabeth Franzmann","doi":"10.1016/j.amjoto.2025.104667","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>This study aimed to evaluate the diagnostic accuracy, comprehensiveness, and clinical relevance of two advanced artificial intelligence (AI) models, OpenAI's ChatGPT-4.0 and DeepSeek-R1, in the field of otolaryngology.</div></div><div><h3>Methods</h3><div>Five common otolaryngology procedures—adenotonsillectomy, tympanoplasty, endoscopic sinus surgery, parotidectomy, and total laryngectomy—were analyzed through standardized queries posed to both AI models<strong>.</strong> Because the prompts replicate questions that patients typically search online, our evaluation focuses on patient-facing informational adequacy. Responses were independently evaluated by two study members for accuracy, clinical relevance, and comprehensiveness, with discrepancies resolved through consensus. The analysis included comparison with clinical guidelines.</div></div><div><h3>Results</h3><div>ChatGPT-4.0 generally provided detailed procedural insights, effectively covering indications, methodologies, risks, and recovery processes. However, it occasionally suggested excessive diagnostic imaging and omitted subtle yet significant surgical nuances. DeepSeek-R1 delivered concise, structured responses clearly categorizing indications, treatment alternatives, and procedural risks. Nonetheless, it frequently lacked detailed elaboration, omitting important surgical techniques and minor complications. For instance, DeepSeek-R1 omitted specifics such as hemostatic techniques in adenotonsillectomy and graft stabilization details in tympanoplasty. Neither model adequately addressed critical elements like comprehensive staging, detailed surgical planning, and long-term recovery nuances, especially for complex procedures such as total laryngectomy.</div></div><div><h3>Conclusions</h3><div>Both ChatGPT-4.0 and DeepSeek-R1 demonstrated significant diagnostic potential but revealed limitations in precision, comprehensiveness, and nuanced clinical reasoning. Their clinical utility remains restricted, highlighting a continued need for AI refinement to enhance patient-specific decision-making capabilities in otolaryngology.</div></div>","PeriodicalId":7591,"journal":{"name":"American Journal of Otolaryngology","volume":"46 4","pages":"Article 104667"},"PeriodicalIF":1.8000,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Otolaryngology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0196070925000705","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
This study aimed to evaluate the diagnostic accuracy, comprehensiveness, and clinical relevance of two advanced artificial intelligence (AI) models, OpenAI's ChatGPT-4.0 and DeepSeek-R1, in the field of otolaryngology.
Methods
Five common otolaryngology procedures—adenotonsillectomy, tympanoplasty, endoscopic sinus surgery, parotidectomy, and total laryngectomy—were analyzed through standardized queries posed to both AI models. Because the prompts replicate questions that patients typically search online, our evaluation focuses on patient-facing informational adequacy. Responses were independently evaluated by two study members for accuracy, clinical relevance, and comprehensiveness, with discrepancies resolved through consensus. The analysis included comparison with clinical guidelines.
Results
ChatGPT-4.0 generally provided detailed procedural insights, effectively covering indications, methodologies, risks, and recovery processes. However, it occasionally suggested excessive diagnostic imaging and omitted subtle yet significant surgical nuances. DeepSeek-R1 delivered concise, structured responses clearly categorizing indications, treatment alternatives, and procedural risks. Nonetheless, it frequently lacked detailed elaboration, omitting important surgical techniques and minor complications. For instance, DeepSeek-R1 omitted specifics such as hemostatic techniques in adenotonsillectomy and graft stabilization details in tympanoplasty. Neither model adequately addressed critical elements like comprehensive staging, detailed surgical planning, and long-term recovery nuances, especially for complex procedures such as total laryngectomy.
Conclusions
Both ChatGPT-4.0 and DeepSeek-R1 demonstrated significant diagnostic potential but revealed limitations in precision, comprehensiveness, and nuanced clinical reasoning. Their clinical utility remains restricted, highlighting a continued need for AI refinement to enhance patient-specific decision-making capabilities in otolaryngology.
期刊介绍:
Be fully informed about developments in otology, neurotology, audiology, rhinology, allergy, laryngology, speech science, bronchoesophagology, facial plastic surgery, and head and neck surgery. Featured sections include original contributions, grand rounds, current reviews, case reports and socioeconomics.