Meta LLaMa 3.1在胸部成像和诊断中的性能评价

iRadiology Pub Date : 2025-05-11 DOI:10.1002/ird3.70013

Golnaz Lotfian, Keyur Parekh, Pokhraj P. Suthar

{"title":"Meta LLaMa 3.1在胸部成像和诊断中的性能评价","authors":"Golnaz Lotfian, Keyur Parekh, Pokhraj P. Suthar","doi":"10.1002/ird3.70013","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>The integration of artificial intelligence (AI) in radiology has opened new possibilities for diagnostic accuracy, with large language models (LLMs) showing potential for supporting clinical decision-making. While proprietary models like ChatGPT have gained attention, open-source alternatives such as Meta LLaMa 3.1 remain underexplored. This study aims to evaluate the diagnostic accuracy of LLaMa 3.1 in thoracic imaging and to discuss broader implications of open-source versus proprietary AI models in healthcare.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Meta LLaMa 3.1 (8B parameter version) was tested on 126 multiple-choice thoracic imaging questions selected from <i>Thoracic Imaging: A Core Review</i> by Hobbs et al. These questions required no image interpretation. The model’s answers were validated by two board-certified diagnostic radiologists. Accuracy was assessed overall and across subgroups, including intensive care, pathology, and anatomy. Additionally, a narrative review introduces three widely used AI platforms in thoracic imaging: DeepLesion, ChexNet, and 3D Slicer.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>LLaMa 3.1 achieved an overall accuracy of 61.1%. It performed well in intensive care (90.0%) and terms and signs (83.3%) but showed variability across subgroups, with lower accuracy in normal anatomy and basic imaging (40.0%). Subgroup analysis revealed strengths in infectious pneumonia and pleural disease, but notable weaknesses in lung cancer and vascular pathology.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>LLaMa 3.1 demonstrates promise as an open-source NLP tool in thoracic diagnostics, though its performance variability highlights the need for refinement and domain-specific training. Open-source models offer transparency and accessibility, while proprietary models deliver consistency. Both hold value, depending on clinical context and resource availability.</p>\n </section>\n </div>","PeriodicalId":73508,"journal":{"name":"iRadiology","volume":"3 4","pages":"279-288"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ird3.70013","citationCount":"0","resultStr":"{\"title\":\"Performance Review of Meta LLaMa 3.1 in Thoracic Imaging and Diagnostics\",\"authors\":\"Golnaz Lotfian, Keyur Parekh, Pokhraj P. Suthar\",\"doi\":\"10.1002/ird3.70013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Background</h3>\\n \\n <p>The integration of artificial intelligence (AI) in radiology has opened new possibilities for diagnostic accuracy, with large language models (LLMs) showing potential for supporting clinical decision-making. While proprietary models like ChatGPT have gained attention, open-source alternatives such as Meta LLaMa 3.1 remain underexplored. This study aims to evaluate the diagnostic accuracy of LLaMa 3.1 in thoracic imaging and to discuss broader implications of open-source versus proprietary AI models in healthcare.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>Meta LLaMa 3.1 (8B parameter version) was tested on 126 multiple-choice thoracic imaging questions selected from <i>Thoracic Imaging: A Core Review</i> by Hobbs et al. These questions required no image interpretation. The model’s answers were validated by two board-certified diagnostic radiologists. Accuracy was assessed overall and across subgroups, including intensive care, pathology, and anatomy. Additionally, a narrative review introduces three widely used AI platforms in thoracic imaging: DeepLesion, ChexNet, and 3D Slicer.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>LLaMa 3.1 achieved an overall accuracy of 61.1%. It performed well in intensive care (90.0%) and terms and signs (83.3%) but showed variability across subgroups, with lower accuracy in normal anatomy and basic imaging (40.0%). Subgroup analysis revealed strengths in infectious pneumonia and pleural disease, but notable weaknesses in lung cancer and vascular pathology.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusion</h3>\\n \\n <p>LLaMa 3.1 demonstrates promise as an open-source NLP tool in thoracic diagnostics, though its performance variability highlights the need for refinement and domain-specific training. Open-source models offer transparency and accessibility, while proprietary models deliver consistency. Both hold value, depending on clinical context and resource availability.</p>\\n </section>\\n </div>\",\"PeriodicalId\":73508,\"journal\":{\"name\":\"iRadiology\",\"volume\":\"3 4\",\"pages\":\"279-288\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-05-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ird3.70013\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"iRadiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ird3.70013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"iRadiology","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ird3.70013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人工智能（AI）在放射学中的整合为诊断准确性开辟了新的可能性，大型语言模型（llm）显示出支持临床决策的潜力。虽然像ChatGPT这样的专有模型已经引起了人们的注意，但像Meta LLaMa 3.1这样的开源替代方案仍未得到充分开发。本研究旨在评估LLaMa 3.1在胸部成像中的诊断准确性，并讨论开源与专有人工智能模型在医疗保健领域的更广泛影响。方法Meta LLaMa 3.1 （8B参数版）对Hobbs等人从《thoracic imaging: A Core Review》中选择的126道胸部影像学选择题进行测试。这些问题不需要图像解释。该模型的答案由两名委员会认证的诊断放射科医生验证。准确性进行了总体和跨亚组评估，包括重症监护、病理和解剖。此外，本文还介绍了三种广泛应用于胸部成像的人工智能平台：DeepLesion、ChexNet和3D Slicer。结果LLaMa 3.1的总体准确率为61.1%。它在重症监护（90.0%）和术语和体征（83.3%）方面表现良好，但在亚组之间表现出差异，正常解剖和基本影像学的准确性较低（40.0%）。亚组分析显示在感染性肺炎和胸膜疾病方面有优势，但在肺癌和血管病理学方面有明显的劣势。结论LLaMa 3.1有望成为胸腔诊断的开源NLP工具，但其性能的可变性突出了改进和特定领域训练的必要性。开源模型提供透明性和可访问性，而专有模型提供一致性。两者都有价值，取决于临床环境和资源的可用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Performance Review of Meta LLaMa 3.1 in Thoracic Imaging and Diagnostics

查看原文本刊更多论文

Performance Review of Meta LLaMa 3.1 in Thoracic Imaging and Diagnostics

Background

The integration of artificial intelligence (AI) in radiology has opened new possibilities for diagnostic accuracy, with large language models (LLMs) showing potential for supporting clinical decision-making. While proprietary models like ChatGPT have gained attention, open-source alternatives such as Meta LLaMa 3.1 remain underexplored. This study aims to evaluate the diagnostic accuracy of LLaMa 3.1 in thoracic imaging and to discuss broader implications of open-source versus proprietary AI models in healthcare.

Methods

Meta LLaMa 3.1 (8B parameter version) was tested on 126 multiple-choice thoracic imaging questions selected from Thoracic Imaging: A Core Review by Hobbs et al. These questions required no image interpretation. The model’s answers were validated by two board-certified diagnostic radiologists. Accuracy was assessed overall and across subgroups, including intensive care, pathology, and anatomy. Additionally, a narrative review introduces three widely used AI platforms in thoracic imaging: DeepLesion, ChexNet, and 3D Slicer.

Results

LLaMa 3.1 achieved an overall accuracy of 61.1%. It performed well in intensive care (90.0%) and terms and signs (83.3%) but showed variability across subgroups, with lower accuracy in normal anatomy and basic imaging (40.0%). Subgroup analysis revealed strengths in infectious pneumonia and pleural disease, but notable weaknesses in lung cancer and vascular pathology.

Conclusion

LLaMa 3.1 demonstrates promise as an open-source NLP tool in thoracic diagnostics, though its performance variability highlights the need for refinement and domain-specific training. Open-source models offer transparency and accessibility, while proprietary models deliver consistency. Both hold value, depending on clinical context and resource availability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

iRadiology

自引率

0.00%

发文量