Structured Transformation of Unstructured Prostate MRI Reports Using Large Language Models.

IF 2.2 4区 医学 Q2 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Luca Di Palma, Fatemeh Darvizeh, Marco Alì, Deborah Fazzini
{"title":"Structured Transformation of Unstructured Prostate MRI Reports Using Large Language Models.","authors":"Luca Di Palma, Fatemeh Darvizeh, Marco Alì, Deborah Fazzini","doi":"10.3390/tomography11060069","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>to assess the ability of high-performing open-weight large language models (LLMs) in extracting key radiological features from prostate MRI reports.</p><p><strong>Methods: </strong>Five LLMs (Llama3.3, DeepSeek-R1-Llama3.3, Phi4, Gemma-2, and Qwen2.5-14B) were used to analyze free-text MRI reports retrieved from clinical practice. Each LLM processed reports three times using specialized prompts to extract (1) dimensions, (2) volume and PSA density, and (3) lesion characteristics. An experienced radiologist manually annotated the dataset, defining entities (Exam) and sub-entities (Lesion, Dimension). Feature- and physician-level performance were then assessed.</p><p><strong>Results: </strong>250 MRI exams reported by 7 radiologists were analyzed by the LLMs. Feature-level performances showed that DeepSeek-R1-Llama3.3 exhibited the highest average score (98.6% ± 2.1%), followed by Phi4 (98.1% ± 2.2%), Llama3.3 (98.0% ± 3.0%), Qwen2.5 (97.5% ± 3.9%), and Gemma2 (96.0% ± 3.4%). All models excelled in extracting PSA density (100%) and volume (≥98.4%), while lesions' extraction showed greater variability (88.4-94.0%). LLMs' performance varied among radiologists: Physician B's reports yielded the highest mean score (99.9% ± 0.2%), while Physician C's resulted in the lowest (94.4% ± 2.3%).</p><p><strong>Conclusions: </strong>LLMs showed promising results in automated feature-extraction from radiology reports, with DeepSeek-R1-Llama3.3 achieving the highest overall score. These models can improve clinical workflows by structuring unstructured medical text. However, a preliminary analysis of reporting styles is necessary to identify potential challenges and optimize prompt design to better align with individual physician reporting styles. This approach can further enhance the robustness and adaptability of LLM-driven clinical data extraction.</p>","PeriodicalId":51330,"journal":{"name":"Tomography","volume":"11 6","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12196861/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tomography","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/tomography11060069","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: to assess the ability of high-performing open-weight large language models (LLMs) in extracting key radiological features from prostate MRI reports.

Methods: Five LLMs (Llama3.3, DeepSeek-R1-Llama3.3, Phi4, Gemma-2, and Qwen2.5-14B) were used to analyze free-text MRI reports retrieved from clinical practice. Each LLM processed reports three times using specialized prompts to extract (1) dimensions, (2) volume and PSA density, and (3) lesion characteristics. An experienced radiologist manually annotated the dataset, defining entities (Exam) and sub-entities (Lesion, Dimension). Feature- and physician-level performance were then assessed.

Results: 250 MRI exams reported by 7 radiologists were analyzed by the LLMs. Feature-level performances showed that DeepSeek-R1-Llama3.3 exhibited the highest average score (98.6% ± 2.1%), followed by Phi4 (98.1% ± 2.2%), Llama3.3 (98.0% ± 3.0%), Qwen2.5 (97.5% ± 3.9%), and Gemma2 (96.0% ± 3.4%). All models excelled in extracting PSA density (100%) and volume (≥98.4%), while lesions' extraction showed greater variability (88.4-94.0%). LLMs' performance varied among radiologists: Physician B's reports yielded the highest mean score (99.9% ± 0.2%), while Physician C's resulted in the lowest (94.4% ± 2.3%).

Conclusions: LLMs showed promising results in automated feature-extraction from radiology reports, with DeepSeek-R1-Llama3.3 achieving the highest overall score. These models can improve clinical workflows by structuring unstructured medical text. However, a preliminary analysis of reporting styles is necessary to identify potential challenges and optimize prompt design to better align with individual physician reporting styles. This approach can further enhance the robustness and adaptability of LLM-driven clinical data extraction.

使用大型语言模型对非结构化前列腺MRI报告进行结构化转换。
目的:评估高性能开重大语言模型(LLMs)从前列腺MRI报告中提取关键放射学特征的能力。方法:使用5个LLMs (Llama3.3、DeepSeek-R1-Llama3.3、Phi4、Gemma-2和Qwen2.5-14B)分析从临床实践中检索到的自由文本MRI报告。每个LLM使用专门的提示对报告进行三次处理,以提取(1)尺寸,(2)体积和PSA密度,(3)病变特征。一位经验丰富的放射科医生手动注释了数据集,定义了实体(Exam)和子实体(病变、维度)。然后评估特征水平和医生水平的表现。结果:LLMs分析了7名放射科医生报告的250例MRI检查。特征级性能显示,DeepSeek-R1-Llama3.3平均得分最高(98.6%±2.1%),其次是Phi4(98.1%±2.2%)、Llama3.3(98.0%±3.0%)、Qwen2.5(97.5%±3.9%)和Gemma2(96.0%±3.4%)。所有模型在提取PSA密度(100%)和体积(≥98.4%)方面均表现优异,而病灶提取表现出较大的差异性(88.4-94.0%)。llm的表现因放射科医生而异:医师B的报告平均得分最高(99.9%±0.2%),而医师C的报告平均得分最低(94.4%±2.3%)。结论:llm在放射学报告的自动特征提取方面显示出良好的效果,其中DeepSeek-R1-Llama3.3获得了最高的总分。这些模型可以通过结构化非结构化医学文本来改进临床工作流程。然而,报告风格的初步分析是必要的,以确定潜在的挑战和优化提示设计,以更好地配合个别医生的报告风格。该方法可以进一步增强llm驱动的临床数据提取的鲁棒性和适应性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Tomography
Tomography Medicine-Radiology, Nuclear Medicine and Imaging
CiteScore
2.70
自引率
10.50%
发文量
222
期刊介绍: TomographyTM publishes basic (technical and pre-clinical) and clinical scientific articles which involve the advancement of imaging technologies. Tomography encompasses studies that use single or multiple imaging modalities including for example CT, US, PET, SPECT, MR and hyperpolarization technologies, as well as optical modalities (i.e. bioluminescence, photoacoustic, endomicroscopy, fiber optic imaging and optical computed tomography) in basic sciences, engineering, preclinical and clinical medicine. Tomography also welcomes studies involving exploration and refinement of contrast mechanisms and image-derived metrics within and across modalities toward the development of novel imaging probes for image-based feedback and intervention. The use of imaging in biology and medicine provides unparalleled opportunities to noninvasively interrogate tissues to obtain real-time dynamic and quantitative information required for diagnosis and response to interventions and to follow evolving pathological conditions. As multi-modal studies and the complexities of imaging technologies themselves are ever increasing to provide advanced information to scientists and clinicians. Tomography provides a unique publication venue allowing investigators the opportunity to more precisely communicate integrated findings related to the diverse and heterogeneous features associated with underlying anatomical, physiological, functional, metabolic and molecular genetic activities of normal and diseased tissue. Thus Tomography publishes peer-reviewed articles which involve the broad use of imaging of any tissue and disease type including both preclinical and clinical investigations. In addition, hardware/software along with chemical and molecular probe advances are welcome as they are deemed to significantly contribute towards the long-term goal of improving the overall impact of imaging on scientific and clinical discovery.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信