Mitchell J Feldman, Edward P Hoffer, Jared J Conley, Jaime Chang, Jeanhee A Chung, Michael C Jernigan, William T Lester, Zachary H Strasser, Henry C Chueh
{"title":"Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses.","authors":"Mitchell J Feldman, Edward P Hoffer, Jared J Conley, Jaime Chang, Jeanhee A Chung, Michael C Jernigan, William T Lester, Zachary H Strasser, Henry C Chueh","doi":"10.1001/jamanetworkopen.2025.12994","DOIUrl":null,"url":null,"abstract":"<p><strong>Importance: </strong>Large language models (LLMs) have not yet been compared with traditional diagnostic decision support systems (DDSSs) on unpublished clinical cases.</p><p><strong>Objective: </strong>To compare the performance of 2 widely used LLMs (ChatGPT, version 4 [hereafter, LLM1] and Gemini, version 1.5 [hereafter, LLM2]) with a DDSS (DXplain [hereafter, DDSS]) on 36 unpublished general medicine cases.</p><p><strong>Design, setting, and participants: </strong>This diagnostic study, conducted from October 6, 2023, to November 22, 2024, looked for the presence of the known case diagnosis in the differential diagnoses of the LLMs and DDSS after data from previously unpublished clinical cases from 3 academic medical centers were entered. The systems' performance was assessed both with and without laboratory test data. Each case was reviewed by 3 physicians blinded to the case diagnosis. Physicians identified all clinical findings as well as the subset deemed relevant to making the diagnosis for mapping to the DDSS's controlled vocabulary. Two other physicians, also blinded to the diagnoses, entered the data from these cases into the DDSS, LLM1, and LLM2.</p><p><strong>Exposures: </strong>All cases were entered into each LLM twice, with and without laboratory test results. For the DDSS, each case was entered 4 times: for all findings and for findings relevant to the diagnosis, each with and without laboratory test results. The top 25 diagnoses in each resulting differential diagnosis were reviewed.</p><p><strong>Main outcomes and measures: </strong>Presence or absence of the case diagnosis in the system's differential diagnosis and, when present, in which quintile it appeared in the top 25 diagnoses.</p><p><strong>Results: </strong>Among 36 patient cases of various races and ethnicities, genders, and ages (mean [SD] age, 51.4 [16.4] years), in the version with all findings but no laboratory test results, the DDSS listed the case diagnosis in its differential diagnosis more often (56% [20 of 36]) than LLM1 (42% [15 of 36]) and LLM2 (39% [14 of 36]), although this difference did not reach statistical significance (DDSS vs LLMI, P = .09; DDSS vs LLM2, P = .08). All 3 systems listed the case diagnosis in most cases if laboratory test results were included (all findings DDSS, 72% [26 of 36]; LLM1, 64% [23 of 36]; and LLM2, 58% [21 of 36]).</p><p><strong>Conclusions and relevance: </strong>In this diagnostic study comparing the performance of a traditional DDSS and current LLMs on unpublished clinical cases, in most cases, every system listed the case diagnosis in their top 25 diagnoses if laboratory test results were included. A hybrid approach that combines the parsing and expository linguistic capabilities of LLMs with the deterministic and explanatory capabilities of traditional DDSSs may produce synergistic benefits.</p>","PeriodicalId":14694,"journal":{"name":"JAMA Network Open","volume":"8 5","pages":"e2512994"},"PeriodicalIF":10.5000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12123466/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMA Network Open","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1001/jamanetworkopen.2025.12994","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Importance: Large language models (LLMs) have not yet been compared with traditional diagnostic decision support systems (DDSSs) on unpublished clinical cases.
Objective: To compare the performance of 2 widely used LLMs (ChatGPT, version 4 [hereafter, LLM1] and Gemini, version 1.5 [hereafter, LLM2]) with a DDSS (DXplain [hereafter, DDSS]) on 36 unpublished general medicine cases.
Design, setting, and participants: This diagnostic study, conducted from October 6, 2023, to November 22, 2024, looked for the presence of the known case diagnosis in the differential diagnoses of the LLMs and DDSS after data from previously unpublished clinical cases from 3 academic medical centers were entered. The systems' performance was assessed both with and without laboratory test data. Each case was reviewed by 3 physicians blinded to the case diagnosis. Physicians identified all clinical findings as well as the subset deemed relevant to making the diagnosis for mapping to the DDSS's controlled vocabulary. Two other physicians, also blinded to the diagnoses, entered the data from these cases into the DDSS, LLM1, and LLM2.
Exposures: All cases were entered into each LLM twice, with and without laboratory test results. For the DDSS, each case was entered 4 times: for all findings and for findings relevant to the diagnosis, each with and without laboratory test results. The top 25 diagnoses in each resulting differential diagnosis were reviewed.
Main outcomes and measures: Presence or absence of the case diagnosis in the system's differential diagnosis and, when present, in which quintile it appeared in the top 25 diagnoses.
Results: Among 36 patient cases of various races and ethnicities, genders, and ages (mean [SD] age, 51.4 [16.4] years), in the version with all findings but no laboratory test results, the DDSS listed the case diagnosis in its differential diagnosis more often (56% [20 of 36]) than LLM1 (42% [15 of 36]) and LLM2 (39% [14 of 36]), although this difference did not reach statistical significance (DDSS vs LLMI, P = .09; DDSS vs LLM2, P = .08). All 3 systems listed the case diagnosis in most cases if laboratory test results were included (all findings DDSS, 72% [26 of 36]; LLM1, 64% [23 of 36]; and LLM2, 58% [21 of 36]).
Conclusions and relevance: In this diagnostic study comparing the performance of a traditional DDSS and current LLMs on unpublished clinical cases, in most cases, every system listed the case diagnosis in their top 25 diagnoses if laboratory test results were included. A hybrid approach that combines the parsing and expository linguistic capabilities of LLMs with the deterministic and explanatory capabilities of traditional DDSSs may produce synergistic benefits.
期刊介绍:
JAMA Network Open, a member of the esteemed JAMA Network, stands as an international, peer-reviewed, open-access general medical journal.The publication is dedicated to disseminating research across various health disciplines and countries, encompassing clinical care, innovation in health care, health policy, and global health.
JAMA Network Open caters to clinicians, investigators, and policymakers, providing a platform for valuable insights and advancements in the medical field. As part of the JAMA Network, a consortium of peer-reviewed general medical and specialty publications, JAMA Network Open contributes to the collective knowledge and understanding within the medical community.