Performance Review of Meta LLaMa 3.1 in Thoracic Imaging and Diagnostics

iRadiology Pub Date : 2025-05-11 DOI:10.1002/ird3.70013
Golnaz Lotfian, Keyur Parekh, Pokhraj P. Suthar
{"title":"Performance Review of Meta LLaMa 3.1 in Thoracic Imaging and Diagnostics","authors":"Golnaz Lotfian,&nbsp;Keyur Parekh,&nbsp;Pokhraj P. Suthar","doi":"10.1002/ird3.70013","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>The integration of artificial intelligence (AI) in radiology has opened new possibilities for diagnostic accuracy, with large language models (LLMs) showing potential for supporting clinical decision-making. While proprietary models like ChatGPT have gained attention, open-source alternatives such as Meta LLaMa 3.1 remain underexplored. This study aims to evaluate the diagnostic accuracy of LLaMa 3.1 in thoracic imaging and to discuss broader implications of open-source versus proprietary AI models in healthcare.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Meta LLaMa 3.1 (8B parameter version) was tested on 126 multiple-choice thoracic imaging questions selected from <i>Thoracic Imaging: A Core Review</i> by Hobbs et al. These questions required no image interpretation. The model’s answers were validated by two board-certified diagnostic radiologists. Accuracy was assessed overall and across subgroups, including intensive care, pathology, and anatomy. Additionally, a narrative review introduces three widely used AI platforms in thoracic imaging: DeepLesion, ChexNet, and 3D Slicer.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>LLaMa 3.1 achieved an overall accuracy of 61.1%. It performed well in intensive care (90.0%) and terms and signs (83.3%) but showed variability across subgroups, with lower accuracy in normal anatomy and basic imaging (40.0%). Subgroup analysis revealed strengths in infectious pneumonia and pleural disease, but notable weaknesses in lung cancer and vascular pathology.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>LLaMa 3.1 demonstrates promise as an open-source NLP tool in thoracic diagnostics, though its performance variability highlights the need for refinement and domain-specific training. Open-source models offer transparency and accessibility, while proprietary models deliver consistency. Both hold value, depending on clinical context and resource availability.</p>\n </section>\n </div>","PeriodicalId":73508,"journal":{"name":"iRadiology","volume":"3 4","pages":"279-288"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ird3.70013","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"iRadiology","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ird3.70013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background

The integration of artificial intelligence (AI) in radiology has opened new possibilities for diagnostic accuracy, with large language models (LLMs) showing potential for supporting clinical decision-making. While proprietary models like ChatGPT have gained attention, open-source alternatives such as Meta LLaMa 3.1 remain underexplored. This study aims to evaluate the diagnostic accuracy of LLaMa 3.1 in thoracic imaging and to discuss broader implications of open-source versus proprietary AI models in healthcare.

Methods

Meta LLaMa 3.1 (8B parameter version) was tested on 126 multiple-choice thoracic imaging questions selected from Thoracic Imaging: A Core Review by Hobbs et al. These questions required no image interpretation. The model’s answers were validated by two board-certified diagnostic radiologists. Accuracy was assessed overall and across subgroups, including intensive care, pathology, and anatomy. Additionally, a narrative review introduces three widely used AI platforms in thoracic imaging: DeepLesion, ChexNet, and 3D Slicer.

Results

LLaMa 3.1 achieved an overall accuracy of 61.1%. It performed well in intensive care (90.0%) and terms and signs (83.3%) but showed variability across subgroups, with lower accuracy in normal anatomy and basic imaging (40.0%). Subgroup analysis revealed strengths in infectious pneumonia and pleural disease, but notable weaknesses in lung cancer and vascular pathology.

Conclusion

LLaMa 3.1 demonstrates promise as an open-source NLP tool in thoracic diagnostics, though its performance variability highlights the need for refinement and domain-specific training. Open-source models offer transparency and accessibility, while proprietary models deliver consistency. Both hold value, depending on clinical context and resource availability.

Abstract Image

Meta LLaMa 3.1在胸部成像和诊断中的性能评价
人工智能(AI)在放射学中的整合为诊断准确性开辟了新的可能性,大型语言模型(llm)显示出支持临床决策的潜力。虽然像ChatGPT这样的专有模型已经引起了人们的注意,但像Meta LLaMa 3.1这样的开源替代方案仍未得到充分开发。本研究旨在评估LLaMa 3.1在胸部成像中的诊断准确性,并讨论开源与专有人工智能模型在医疗保健领域的更广泛影响。方法Meta LLaMa 3.1 (8B参数版)对Hobbs等人从《thoracic imaging: A Core Review》中选择的126道胸部影像学选择题进行测试。这些问题不需要图像解释。该模型的答案由两名委员会认证的诊断放射科医生验证。准确性进行了总体和跨亚组评估,包括重症监护、病理和解剖。此外,本文还介绍了三种广泛应用于胸部成像的人工智能平台:DeepLesion、ChexNet和3D Slicer。结果LLaMa 3.1的总体准确率为61.1%。它在重症监护(90.0%)和术语和体征(83.3%)方面表现良好,但在亚组之间表现出差异,正常解剖和基本影像学的准确性较低(40.0%)。亚组分析显示在感染性肺炎和胸膜疾病方面有优势,但在肺癌和血管病理学方面有明显的劣势。结论LLaMa 3.1有望成为胸腔诊断的开源NLP工具,但其性能的可变性突出了改进和特定领域训练的必要性。开源模型提供透明性和可访问性,而专有模型提供一致性。两者都有价值,取决于临床环境和资源的可用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信