{"title":"The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study.","authors":"Yasin Celal Gunes, Turay Cesur","doi":"10.1097/RTI.0000000000000805","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology.</p><p><strong>Materials and methods: </strong>We collected publicly available 124 \"Case of the Month\" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ2, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.</p><p><strong>Results: </strong>Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs (P<0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 (P<0.05). All LLMs and radiologists showed greater accuracy in specific cases (P<0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity (P>0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups (P>0.05), except for Meta Llama 3 70b in the vascular cases (P=0.040).</p><p><strong>Conclusions: </strong>Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.</p>","PeriodicalId":49974,"journal":{"name":"Journal of Thoracic Imaging","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thoracic Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/RTI.0000000000000805","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology.
Materials and methods: We collected publicly available 124 "Case of the Month" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ2, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.
Results: Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs (P<0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 (P<0.05). All LLMs and radiologists showed greater accuracy in specific cases (P<0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity (P>0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups (P>0.05), except for Meta Llama 3 70b in the vascular cases (P=0.040).
Conclusions: Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.
期刊介绍:
Journal of Thoracic Imaging (JTI) provides authoritative information on all aspects of the use of imaging techniques in the diagnosis of cardiac and pulmonary diseases. Original articles and analytical reviews published in this timely journal provide the very latest thinking of leading experts concerning the use of chest radiography, computed tomography, magnetic resonance imaging, positron emission tomography, ultrasound, and all other promising imaging techniques in cardiopulmonary radiology.
Official Journal of the Society of Thoracic Radiology:
Japanese Society of Thoracic Radiology
Korean Society of Thoracic Radiology
European Society of Thoracic Imaging.