Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

IF 2.3 3区医学 Q2 PEDIATRICS

Pediatric Radiology Pub Date : 2025-08-01 Epub Date: 2025-07-12 DOI:10.1007/s00247-025-06317-z

Jinho Jung, Michael Phillipi, Bryant Tran, Kasha Chen, Nathan Chan, Erwin Ho, Shawn Sun, Roozbeh Houshyar

{"title":"Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.","authors":"Jinho Jung, Michael Phillipi, Bryant Tran, Kasha Chen, Nathan Chan, Erwin Ho, Shawn Sun, Roozbeh Houshyar","doi":"10.1007/s00247-025-06317-z","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLM) have shown promise in assisting medical decision-making. However, there is limited literature exploring the diagnostic accuracy of LLMs in generating differential diagnoses from text-based image descriptions and clinical presentations in pediatric radiology.Objective: To examine the performance of multiple proprietary LLMs in producing accurate differential diagnoses for text-based pediatric radiological cases without imaging.Materials and methods: One hundred sixty-four cases were retrospectively selected from a pediatric radiology textbook and converted into two formats: (1) image description only, and (2) image description with clinical presentation. The ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro algorithms were given these inputs and tasked with providing a top 1 diagnosis and a top 3 differential diagnoses. Accuracy of responses was assessed by comparison with the original literature. Top 1 accuracy was defined as whether the top 1 diagnosis matched the textbook, and top 3 differential accuracy was defined as the number of diagnoses in the model-generated top 3 differential that matched any of the top 3 diagnoses in the textbook. McNemar's test, Cochran's Q test, Friedman test, and Wilcoxon signed-rank test were used to compare algorithms and assess the impact of added clinical information, respectively.Results: There was no significant difference in top 1 accuracy between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro when only image descriptions were provided (56.1% [95% CI 48.4-63.5], 64.6% [95% CI 57.1-71.5], 61.6% [95% CI 54.0-68.7]; P = 0.11). Adding clinical presentation to image description significantly improved top 1 accuracy for ChatGPT-4 V (64.0% [95% CI 56.4-71.0], P = 0.02) and Claude 3.5 Sonnet (80.5% [95% CI 73.8-85.8], P < 0.001). For image description and clinical presentation cases, Claude 3.5 Sonnet significantly outperformed both ChatGPT-4 V and Gemini 1.5 Pro (P < 0.001). For top 3 differential accuracy, no significant differences were observed between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro, regardless of whether the cases included only image descriptions (1.29 [95% CI 1.16-1.41], 1.35 [95% CI 1.23-1.48], 1.37 [95% CI 1.25-1.49]; P = 0.60) or both image descriptions and clinical presentations (1.33 [95% CI 1.20-1.45], 1.52 [95% CI 1.41-1.64], 1.48 [95% 1.36-1.59]; P = 0.72). Only Claude 3.5 Sonnet performed significantly better when clinical presentation was added (P < 0.001).Conclusion: Commercial LLMs performed similarly on pediatric radiology cases in providing top 1 accuracy and top 3 differential accuracy when only a text-based image description was used. Adding clinical presentation significantly improved top 1 accuracy for ChatGPT-4 V and Claude 3.5 Sonnet, with Claude showing the largest improvement. Claude 3.5 Sonnet outperformed both ChatGPT-4 V and Gemini 1.5 Pro in top 1 accuracy when both image and clinical data were provided. No significant differences were found in top 3 differential accuracy across models in any condition.","PeriodicalId":19755,"journal":{"name":"Pediatric Radiology","volume":" ","pages":"1927-1933"},"PeriodicalIF":2.3000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12394349/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pediatric Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00247-025-06317-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/12 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"PEDIATRICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLM) have shown promise in assisting medical decision-making. However, there is limited literature exploring the diagnostic accuracy of LLMs in generating differential diagnoses from text-based image descriptions and clinical presentations in pediatric radiology.

Objective: To examine the performance of multiple proprietary LLMs in producing accurate differential diagnoses for text-based pediatric radiological cases without imaging.

Materials and methods: One hundred sixty-four cases were retrospectively selected from a pediatric radiology textbook and converted into two formats: (1) image description only, and (2) image description with clinical presentation. The ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro algorithms were given these inputs and tasked with providing a top 1 diagnosis and a top 3 differential diagnoses. Accuracy of responses was assessed by comparison with the original literature. Top 1 accuracy was defined as whether the top 1 diagnosis matched the textbook, and top 3 differential accuracy was defined as the number of diagnoses in the model-generated top 3 differential that matched any of the top 3 diagnoses in the textbook. McNemar's test, Cochran's Q test, Friedman test, and Wilcoxon signed-rank test were used to compare algorithms and assess the impact of added clinical information, respectively.

Results: There was no significant difference in top 1 accuracy between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro when only image descriptions were provided (56.1% [95% CI 48.4-63.5], 64.6% [95% CI 57.1-71.5], 61.6% [95% CI 54.0-68.7]; P = 0.11). Adding clinical presentation to image description significantly improved top 1 accuracy for ChatGPT-4 V (64.0% [95% CI 56.4-71.0], P = 0.02) and Claude 3.5 Sonnet (80.5% [95% CI 73.8-85.8], P < 0.001). For image description and clinical presentation cases, Claude 3.5 Sonnet significantly outperformed both ChatGPT-4 V and Gemini 1.5 Pro (P < 0.001). For top 3 differential accuracy, no significant differences were observed between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro, regardless of whether the cases included only image descriptions (1.29 [95% CI 1.16-1.41], 1.35 [95% CI 1.23-1.48], 1.37 [95% CI 1.25-1.49]; P = 0.60) or both image descriptions and clinical presentations (1.33 [95% CI 1.20-1.45], 1.52 [95% CI 1.41-1.64], 1.48 [95% 1.36-1.59]; P = 0.72). Only Claude 3.5 Sonnet performed significantly better when clinical presentation was added (P < 0.001).

Conclusion: Commercial LLMs performed similarly on pediatric radiology cases in providing top 1 accuracy and top 3 differential accuracy when only a text-based image description was used. Adding clinical presentation significantly improved top 1 accuracy for ChatGPT-4 V and Claude 3.5 Sonnet, with Claude showing the largest improvement. Claude 3.5 Sonnet outperformed both ChatGPT-4 V and Gemini 1.5 Pro in top 1 accuracy when both image and clinical data were provided. No significant differences were found in top 3 differential accuracy across models in any condition.

Abstract Image

查看原文本刊更多论文

从儿科病例的临床表现和影像学结果中产生鉴别诊断的大型语言模型的准确性。

背景：大型语言模型（LLM）在辅助医疗决策方面显示出前景。然而，在儿童放射学中，从基于文本的图像描述和临床表现生成鉴别诊断时，llm的诊断准确性的研究文献有限。目的：探讨多个专有llm在对基于文本的无影像学儿童放射病例进行准确鉴别诊断方面的表现。材料与方法：回顾性选择儿科放射学教科书中的164例病例，将其转换为两种格式：(1)单纯图像描述和(2)有临床表现的图像描述。chatgpt - 4v、Claude 3.5 Sonnet和Gemini 1.5 Pro算法给出了这些输入，并负责提供前1名诊断和前3名鉴别诊断。通过与原始文献的比较来评估回答的准确性。Top 1准确率定义为前1个诊断是否与教科书匹配，Top 3差分准确率定义为模型生成的前3个差分中与教科书中任何前3个诊断匹配的诊断数量。McNemar检验、Cochran Q检验、Friedman检验和Wilcoxon符号秩检验分别用于比较算法和评估添加临床信息的影响。结果：仅提供图像描述时，ChatGPT-4 V、Claude 3.5 Sonnet和Gemini 1.5 Pro的前1准确率无显著差异(56.1% [95% CI 48.4-63.5]、64.6% [95% CI 57.1-71.5]、61.6% [95% CI 54.0-68.7]；p = 0.11)。在图像描述中加入临床表现显著提高了chatgpt - 4v （64.0% [95% CI 56.4-71.0], P = 0.02）和Claude 3.5 Sonnet （80.5% [95% CI 73.8-85.8]）的前1名准确率，P结论：仅使用基于文本的图像描述时，商业LLMs在儿科放射学病例中提供前1名准确率和前3名差异准确率方面表现相似。增加临床表现显著提高了ChatGPT-4 V和Claude 3.5 Sonnet的前1名准确性，Claude的改善最大。在提供图像和临床数据时，Claude 3.5 Sonnet在前1名的准确性上优于ChatGPT-4 V和Gemini 1.5 Pro。在任何条件下，各模型的前3位差异精度均无显著差异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pediatric Radiology 医学-核医学

CiteScore

4.40

自引率

17.40%

发文量

300

审稿时长

3-6 weeks

期刊介绍： Official Journal of the European Society of Pediatric Radiology, the Society for Pediatric Radiology and the Asian and Oceanic Society for Pediatric Radiology Pediatric Radiology informs its readers of new findings and progress in all areas of pediatric imaging and in related fields. This is achieved by a blend of original papers, complemented by reviews that set out the present state of knowledge in a particular area of the specialty or summarize specific topics in which discussion has led to clear conclusions. Advances in technology, methodology, apparatus and auxiliary equipment are presented, and modifications of standard techniques are described. Manuscripts submitted for publication must contain a statement to the effect that all human studies have been reviewed by the appropriate ethics committee and have therefore been performed in accordance with the ethical standards laid down in an appropriate version of the 1964 Declaration of Helsinki. It should also be stated clearly in the text that all persons gave their informed consent prior to their inclusion in the study. Details that might disclose the identity of the subjects under study should be omitted.