利用大型语言模型从MRI报告中准确分类肝脏病变。

IF 4.4 2区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Computational and structural biotechnology journal Pub Date : 2025-05-21 eCollection Date: 2025-01-01 DOI:10.1016/j.csbj.2025.05.019

Daniel Spitzl, Markus Mergen, Ulrike Bauer, Friederike Jungmann, Keno K Bressem, Felix Busch, Marcus R Makowski, Lisa C Adams, Florian T Gassert

{"title":"利用大型语言模型从MRI报告中准确分类肝脏病变。","authors":"Daniel Spitzl, Markus Mergen, Ulrike Bauer, Friederike Jungmann, Keno K Bressem, Felix Busch, Marcus R Makowski, Lisa C Adams, Florian T Gassert","doi":"10.1016/j.csbj.2025.05.019","DOIUrl":null,"url":null,"abstract":"Background & aims: The rapid advancement of large language models (LLMs) has generated interest in their potential integration in clinical workflows. However, their effectiveness in interpreting complex (imaging) reports remains underexplored and has at times yielded suboptimal results. This study aims to assess the capability of state-of-the-art LLMs to classify liver lesions based solely on textual descriptions from MRI reports, challenging the models to interpret nuanced medical language and diagnostic criteria.Methods: We evaluated multiple LLMs, including GPT-4o, Deepseek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash, on a physician-generated fictitious dataset of 88 MRI reports designed to resemble real clinical radiology documentation. The dataset included a representative spectrum of common liver lesions, such as hepatocellular carcinoma, cholangiocarcinoma, hemangiomas, metastases, and focal nodular hyperplasia. Model performance was assessed using micro and macro F1-scores benchmarked against ground truth labels.Results: Claude 3.5 Sonnet demonstrated the highest diagnostic accuracy among the evaluated models, achieving a micro F1-score of 0.91, outperforming other LLMs in lesion classification.Conclusion: These findings highlight the feasibility of LLMs for text-based diagnostic support, particularly in resource-limited or high-volume clinical settings. While LLMs show promise in medical diagnostics, further validation through prospective studies is necessary to ensure reliable clinical integration. The study emphasizes the importance of rigorous benchmarking to assess model performance comprehensively.","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"2139-2146"},"PeriodicalIF":4.4000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12158552/pdf/","citationCount":"0","resultStr":"{\"title\":\"Leveraging large language models for accurate classification of liver lesions from MRI reports.\",\"authors\":\"Daniel Spitzl, Markus Mergen, Ulrike Bauer, Friederike Jungmann, Keno K Bressem, Felix Busch, Marcus R Makowski, Lisa C Adams, Florian T Gassert\",\"doi\":\"10.1016/j.csbj.2025.05.019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background & aims: The rapid advancement of large language models (LLMs) has generated interest in their potential integration in clinical workflows. However, their effectiveness in interpreting complex (imaging) reports remains underexplored and has at times yielded suboptimal results. This study aims to assess the capability of state-of-the-art LLMs to classify liver lesions based solely on textual descriptions from MRI reports, challenging the models to interpret nuanced medical language and diagnostic criteria.Methods: We evaluated multiple LLMs, including GPT-4o, Deepseek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash, on a physician-generated fictitious dataset of 88 MRI reports designed to resemble real clinical radiology documentation. The dataset included a representative spectrum of common liver lesions, such as hepatocellular carcinoma, cholangiocarcinoma, hemangiomas, metastases, and focal nodular hyperplasia. Model performance was assessed using micro and macro F1-scores benchmarked against ground truth labels.Results: Claude 3.5 Sonnet demonstrated the highest diagnostic accuracy among the evaluated models, achieving a micro F1-score of 0.91, outperforming other LLMs in lesion classification.Conclusion: These findings highlight the feasibility of LLMs for text-based diagnostic support, particularly in resource-limited or high-volume clinical settings. While LLMs show promise in medical diagnostics, further validation through prospective studies is necessary to ensure reliable clinical integration. The study emphasizes the importance of rigorous benchmarking to assess model performance comprehensively.\",\"PeriodicalId\":10715,\"journal\":{\"name\":\"Computational and structural biotechnology journal\",\"volume\":\"27 \",\"pages\":\"2139-2146\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12158552/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational and structural biotechnology journal\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1016/j.csbj.2025.05.019\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.05.019","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景与目的：大型语言模型（llm）的快速发展引起了人们对其在临床工作流程中潜在集成的兴趣。然而，它们在解释复杂（成像）报告方面的有效性仍未得到充分探索，有时会产生次优结果。本研究旨在评估最先进的llm仅基于MRI报告的文本描述对肝脏病变进行分类的能力，挑战模型解释细微的医学语言和诊断标准。方法：我们评估了多个llm，包括gpt - 40， Deepseek V3, Claude 3.5 Sonnet和Gemini 2.0 Flash，基于医生生成的88个MRI报告的虚拟数据集，这些报告旨在类似于真实的临床放射学文档。该数据集包括常见肝脏病变的代表性谱，如肝细胞癌、胆管癌、血管瘤、转移瘤和局灶性结节增生。模型性能评估使用微观和宏观f1分数为基准，对地面真值标签。结果：Claude 3.5 Sonnet在被评估的模型中诊断准确率最高，达到了0.91的微观f1分，在病变分类上优于其他LLMs。结论：这些发现强调了llm用于基于文本的诊断支持的可行性，特别是在资源有限或高容量的临床环境中。虽然llm在医学诊断方面显示出前景，但为了确保可靠的临床整合，需要通过前瞻性研究进一步验证。该研究强调了严格的基准测试对全面评估模型性能的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Leveraging large language models for accurate classification of liver lesions from MRI reports.

Background & aims: The rapid advancement of large language models (LLMs) has generated interest in their potential integration in clinical workflows. However, their effectiveness in interpreting complex (imaging) reports remains underexplored and has at times yielded suboptimal results. This study aims to assess the capability of state-of-the-art LLMs to classify liver lesions based solely on textual descriptions from MRI reports, challenging the models to interpret nuanced medical language and diagnostic criteria.

Methods: We evaluated multiple LLMs, including GPT-4o, Deepseek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash, on a physician-generated fictitious dataset of 88 MRI reports designed to resemble real clinical radiology documentation. The dataset included a representative spectrum of common liver lesions, such as hepatocellular carcinoma, cholangiocarcinoma, hemangiomas, metastases, and focal nodular hyperplasia. Model performance was assessed using micro and macro F1-scores benchmarked against ground truth labels.

Results: Claude 3.5 Sonnet demonstrated the highest diagnostic accuracy among the evaluated models, achieving a micro F1-score of 0.91, outperforming other LLMs in lesion classification.

Conclusion: These findings highlight the feasibility of LLMs for text-based diagnostic support, particularly in resource-limited or high-volume clinical settings. While LLMs show promise in medical diagnostics, further validation through prospective studies is necessary to ensure reliable clinical integration. The study emphasizes the importance of rigorous benchmarking to assess model performance comprehensively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics

CiteScore

9.30

自引率

3.30%

发文量

540

审稿时长

6 weeks

期刊介绍： Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology