Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2025-06-12 eCollection Date: 2025-06-01 DOI:10.1093/jamiaopen/ooaf055

Mehmed T Dinc, Ali E Bardak, Furkan Bahar, Craig Noronha

{"title":"Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.","authors":"Mehmed T Dinc, Ali E Bardak, Furkan Bahar, Craig Noronha","doi":"10.1093/jamiaopen/ooaf055","DOIUrl":null,"url":null,"abstract":"Objectives: This study aimed to systematically evaluate and compare the diagnostic performance of leading large language models (LLMs) in common and complex clinical scenarios, assessing their potential for enhancing clinical reasoning and diagnostic accuracy in authentic clinical decision-making processes.Materials and methods: Diagnostic capabilities of advanced LLMs (Anthropic's Claude, OpenAI's GPT variants, Google's Gemini) were assessed using 60 common cases and 104 complex, real-world cases from Clinical Problem Solvers' morning rounds. Clinical details were disclosed in stages, mirroring authentic clinical decision-making. Models were evaluated on primary and differential diagnosis accuracy at each stage.Results: Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Smaller models notably performed well in common scenarios, matching the performance of larger models.Discussion: This study evaluated leading LLMs for diagnostic accuracy using staged information disclosure, mirroring real-world practice. Notably, Claude 3.7 Sonnet was the top performer. Employing a novel LLM-based evaluation method for large-scale analysis, the research highlights artificial intelligence's (AI's) potential to enhance diagnostics. It underscores the need for useful frameworks to translate accuracy into clinical impact and integrate AI into medical education.Conclusion: Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases. To fully realize their potential for improving patient care, we must now focus on creating practical implementation frameworks and translational research to integrate these powerful AI tools into medicine.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 3","pages":"ooaf055"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12161448/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: This study aimed to systematically evaluate and compare the diagnostic performance of leading large language models (LLMs) in common and complex clinical scenarios, assessing their potential for enhancing clinical reasoning and diagnostic accuracy in authentic clinical decision-making processes.

Materials and methods: Diagnostic capabilities of advanced LLMs (Anthropic's Claude, OpenAI's GPT variants, Google's Gemini) were assessed using 60 common cases and 104 complex, real-world cases from Clinical Problem Solvers' morning rounds. Clinical details were disclosed in stages, mirroring authentic clinical decision-making. Models were evaluated on primary and differential diagnosis accuracy at each stage.

Results: Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Smaller models notably performed well in common scenarios, matching the performance of larger models.

Discussion: This study evaluated leading LLMs for diagnostic accuracy using staged information disclosure, mirroring real-world practice. Notably, Claude 3.7 Sonnet was the top performer. Employing a novel LLM-based evaluation method for large-scale analysis, the research highlights artificial intelligence's (AI's) potential to enhance diagnostics. It underscores the need for useful frameworks to translate accuracy into clinical impact and integrate AI into medical education.

Conclusion: Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases. To fully realize their potential for improving patient care, we must now focus on creating practical implementation frameworks and translational research to integrate these powerful AI tools into medicine.

查看原文本刊更多论文

大型语言模型在临床诊断中的比较分析：跨常见和复杂医疗病例的绩效评估。

目的：本研究旨在系统地评估和比较主流大型语言模型（LLMs）在常见和复杂临床场景中的诊断性能，评估它们在真实临床决策过程中提高临床推理和诊断准确性的潜力。材料和方法：使用临床问题解决者上午查班的60例常见病例和104例复杂的真实病例，评估高级llm （Anthropic的Claude， OpenAI的GPT变体，b谷歌的Gemini）的诊断能力。临床细节分阶段披露，反映真实的临床决策。在每个阶段对模型进行初步和鉴别诊断的准确性评估。结果：高级LLMs在常见情况下具有较高的诊断准确率（bb0 90%）， Claude 3.7在某些情况下具有完美的准确率（100%）。在复杂的病例中，Claude 3.7在最终诊断阶段达到了最高的准确率（83.3%），显著优于较小的模型。较小的模型在常见场景中表现良好，与较大模型的性能相匹配。讨论：本研究评估了领先的llm使用分阶段信息披露的诊断准确性，反映了现实世界的实践。值得注意的是，克劳德·十四行诗是表现最好的。该研究采用了一种新的基于llm的大规模分析评估方法，强调了人工智能（AI）在增强诊断方面的潜力。它强调需要有用的框架，将准确性转化为临床影响，并将人工智能纳入医学教育。结论：领先LLMs在不同的临床病例中具有显著的诊断准确性。为了充分发挥它们改善患者护理的潜力，我们现在必须专注于创建实用的实施框架和转化研究，将这些强大的人工智能工具整合到医学中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊