Mehmed T Dinc, Ali E Bardak, Furkan Bahar, Craig Noronha
{"title":"Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.","authors":"Mehmed T Dinc, Ali E Bardak, Furkan Bahar, Craig Noronha","doi":"10.1093/jamiaopen/ooaf055","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study aimed to systematically evaluate and compare the diagnostic performance of leading large language models (LLMs) in common and complex clinical scenarios, assessing their potential for enhancing clinical reasoning and diagnostic accuracy in authentic clinical decision-making processes.</p><p><strong>Materials and methods: </strong>Diagnostic capabilities of advanced LLMs (Anthropic's Claude, OpenAI's GPT variants, Google's Gemini) were assessed using 60 common cases and 104 complex, real-world cases from Clinical Problem Solvers' morning rounds. Clinical details were disclosed in stages, mirroring authentic clinical decision-making. Models were evaluated on primary and differential diagnosis accuracy at each stage.</p><p><strong>Results: </strong>Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Smaller models notably performed well in common scenarios, matching the performance of larger models.</p><p><strong>Discussion: </strong>This study evaluated leading LLMs for diagnostic accuracy using staged information disclosure, mirroring real-world practice. Notably, Claude 3.7 Sonnet was the top performer. Employing a novel LLM-based evaluation method for large-scale analysis, the research highlights artificial intelligence's (AI's) potential to enhance diagnostics. It underscores the need for useful frameworks to translate accuracy into clinical impact and integrate AI into medical education.</p><p><strong>Conclusion: </strong>Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases. To fully realize their potential for improving patient care, we must now focus on creating practical implementation frameworks and translational research to integrate these powerful AI tools into medicine.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 3","pages":"ooaf055"},"PeriodicalIF":2.5000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12161448/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: This study aimed to systematically evaluate and compare the diagnostic performance of leading large language models (LLMs) in common and complex clinical scenarios, assessing their potential for enhancing clinical reasoning and diagnostic accuracy in authentic clinical decision-making processes.
Materials and methods: Diagnostic capabilities of advanced LLMs (Anthropic's Claude, OpenAI's GPT variants, Google's Gemini) were assessed using 60 common cases and 104 complex, real-world cases from Clinical Problem Solvers' morning rounds. Clinical details were disclosed in stages, mirroring authentic clinical decision-making. Models were evaluated on primary and differential diagnosis accuracy at each stage.
Results: Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Smaller models notably performed well in common scenarios, matching the performance of larger models.
Discussion: This study evaluated leading LLMs for diagnostic accuracy using staged information disclosure, mirroring real-world practice. Notably, Claude 3.7 Sonnet was the top performer. Employing a novel LLM-based evaluation method for large-scale analysis, the research highlights artificial intelligence's (AI's) potential to enhance diagnostics. It underscores the need for useful frameworks to translate accuracy into clinical impact and integrate AI into medical education.
Conclusion: Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases. To fully realize their potential for improving patient care, we must now focus on creating practical implementation frameworks and translational research to integrate these powerful AI tools into medicine.