Performance Review of Meta LLaMa 3.1 in Thoracic Imaging and Diagnostics

iRadiology Pub Date : 2025-05-11 DOI:10.1002/ird3.70013

Golnaz Lotfian, Keyur Parekh, Pokhraj P. Suthar

{"title":"Performance Review of Meta LLaMa 3.1 in Thoracic Imaging and Diagnostics","authors":"Golnaz Lotfian, Keyur Parekh, Pokhraj P. Suthar","doi":"10.1002/ird3.70013","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>The integration of artificial intelligence (AI) in radiology has opened new possibilities for diagnostic accuracy, with large language models (LLMs) showing potential for supporting clinical decision-making. While proprietary models like ChatGPT have gained attention, open-source alternatives such as Meta LLaMa 3.1 remain underexplored. This study aims to evaluate the diagnostic accuracy of LLaMa 3.1 in thoracic imaging and to discuss broader implications of open-source versus proprietary AI models in healthcare.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Meta LLaMa 3.1 (8B parameter version) was tested on 126 multiple-choice thoracic imaging questions selected from <i>Thoracic Imaging: A Core Review</i> by Hobbs et al. These questions required no image interpretation. The model’s answers were validated by two board-certified diagnostic radiologists. Accuracy was assessed overall and across subgroups, including intensive care, pathology, and anatomy. Additionally, a narrative review introduces three widely used AI platforms in thoracic imaging: DeepLesion, ChexNet, and 3D Slicer.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>LLaMa 3.1 achieved an overall accuracy of 61.1%. It performed well in intensive care (90.0%) and terms and signs (83.3%) but showed variability across subgroups, with lower accuracy in normal anatomy and basic imaging (40.0%). Subgroup analysis revealed strengths in infectious pneumonia and pleural disease, but notable weaknesses in lung cancer and vascular pathology.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>LLaMa 3.1 demonstrates promise as an open-source NLP tool in thoracic diagnostics, though its performance variability highlights the need for refinement and domain-specific training. Open-source models offer transparency and accessibility, while proprietary models deliver consistency. Both hold value, depending on clinical context and resource availability.</p>\n </section>\n </div>","PeriodicalId":73508,"journal":{"name":"iRadiology","volume":"3 4","pages":"279-288"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ird3.70013","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"iRadiology","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ird3.70013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background

The integration of artificial intelligence (AI) in radiology has opened new possibilities for diagnostic accuracy, with large language models (LLMs) showing potential for supporting clinical decision-making. While proprietary models like ChatGPT have gained attention, open-source alternatives such as Meta LLaMa 3.1 remain underexplored. This study aims to evaluate the diagnostic accuracy of LLaMa 3.1 in thoracic imaging and to discuss broader implications of open-source versus proprietary AI models in healthcare.

Methods

Meta LLaMa 3.1 (8B parameter version) was tested on 126 multiple-choice thoracic imaging questions selected from Thoracic Imaging: A Core Review by Hobbs et al. These questions required no image interpretation. The model’s answers were validated by two board-certified diagnostic radiologists. Accuracy was assessed overall and across subgroups, including intensive care, pathology, and anatomy. Additionally, a narrative review introduces three widely used AI platforms in thoracic imaging: DeepLesion, ChexNet, and 3D Slicer.

Results

LLaMa 3.1 achieved an overall accuracy of 61.1%. It performed well in intensive care (90.0%) and terms and signs (83.3%) but showed variability across subgroups, with lower accuracy in normal anatomy and basic imaging (40.0%). Subgroup analysis revealed strengths in infectious pneumonia and pleural disease, but notable weaknesses in lung cancer and vascular pathology.

Conclusion

LLaMa 3.1 demonstrates promise as an open-source NLP tool in thoracic diagnostics, though its performance variability highlights the need for refinement and domain-specific training. Open-source models offer transparency and accessibility, while proprietary models deliver consistency. Both hold value, depending on clinical context and resource availability.

Abstract Image

查看原文本刊更多论文

Meta LLaMa 3.1在胸部成像和诊断中的性能评价

人工智能（AI）在放射学中的整合为诊断准确性开辟了新的可能性，大型语言模型（llm）显示出支持临床决策的潜力。虽然像ChatGPT这样的专有模型已经引起了人们的注意，但像Meta LLaMa 3.1这样的开源替代方案仍未得到充分开发。本研究旨在评估LLaMa 3.1在胸部成像中的诊断准确性，并讨论开源与专有人工智能模型在医疗保健领域的更广泛影响。方法Meta LLaMa 3.1 （8B参数版）对Hobbs等人从《thoracic imaging: A Core Review》中选择的126道胸部影像学选择题进行测试。这些问题不需要图像解释。该模型的答案由两名委员会认证的诊断放射科医生验证。准确性进行了总体和跨亚组评估，包括重症监护、病理和解剖。此外，本文还介绍了三种广泛应用于胸部成像的人工智能平台：DeepLesion、ChexNet和3D Slicer。结果LLaMa 3.1的总体准确率为61.1%。它在重症监护（90.0%）和术语和体征（83.3%）方面表现良好，但在亚组之间表现出差异，正常解剖和基本影像学的准确性较低（40.0%）。亚组分析显示在感染性肺炎和胸膜疾病方面有优势，但在肺癌和血管病理学方面有明显的劣势。结论LLaMa 3.1有望成为胸腔诊断的开源NLP工具，但其性能的可变性突出了改进和特定领域训练的必要性。开源模型提供透明性和可访问性，而专有模型提供一致性。两者都有价值，取决于临床环境和资源的可用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

iRadiology

自引率

0.00%

发文量