基于llm的影像学特征提取：克罗恩病的疾病活动自动评分。

IF 3.9 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Academic Radiology Pub Date : 2025-10-01 DOI:10.1016/j.acra.2025.07.041

Reza Dehdab MD , Fiona Mankertz MD , Jan Michael Brendel MD , Nour Maalouf MD , Kenan Kaya MD , Saif Afat MD , Shadi Kolahdoozan MD, MPH, PhD , Amir Reza Radmard MD

{"title":"基于llm的影像学特征提取：克罗恩病的疾病活动自动评分。","authors":"Reza Dehdab MD , Fiona Mankertz MD , Jan Michael Brendel MD , Nour Maalouf MD , Kenan Kaya MD , Saif Afat MD , Shadi Kolahdoozan MD, MPH, PhD , Amir Reza Radmard MD","doi":"10.1016/j.acra.2025.07.041","DOIUrl":null,"url":null,"abstract":"<div><h3>Rationale and Objectives</h3><div>Large Language Models (LLMs) offer a promising solution for extracting structured clinical information from free-text radiology reports. The Simplified Magnetic Resonance Index of Activity (sMARIA) is a validated scoring system used to quantify Crohn’s disease (CD) activity based on Magnetic Resonance Enterography (MRE) findings. This study aims to evaluate the performance of two advanced LLMs in extracting key imaging features and computing sMARIA scores from free-text MRE reports.</div></div><div><h3>Materials and Methods</h3><div>This retrospective study included 117 anonymized free-text MRE reports from patients with confirmed CD. ChatGPT (GPT-4o) and DeepSeek (DeepSeek-R1) were prompted using a structured input designed to extract four key radiologic features relevant to sMARIA: bowel wall thickness, mural edema, perienteric fat stranding, and ulceration. LLM outputs were evaluated against radiologist annotations at both the segment and feature levels. Segment-level agreement was assessed using accuracy, mean absolute error (MAE) and Pearson correlation. Feature-level performance was evaluated using sensitivity, specificity, precision, and F1-score. Errors including confabulations were recorded descriptively<em>.</em></div></div><div><h3>Results</h3><div>ChatGPT achieved a segment-level accuracy of 98.6%, MAE of 0.17, and Pearson correlation of 0.99. DeepSeek achieved 97.3% accuracy, MAE of 0.51, and correlation of 0.96. At the feature level, ChatGPT yielded an F1-score of 98.8% (precision 97.8%, sensitivity 99.9%), while DeepSeek achieved 97.9% (precision 96.0%, sensitivity 99.8%).</div></div><div><h3>Conclusions</h3><div>LLMs demonstrate near-human accuracy in extracting structured information and computing sMARIA scores from free-text MRE reports. This enables automated assessment of CD activity without altering current reporting workflows, supporting longitudinal monitoring and large-scale research. Integration into clinical decision support systems may be feasible in the future, provided appropriate human oversight and validation are ensured.</div></div>","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":"32 10","pages":"Pages 5869-5877"},"PeriodicalIF":3.9000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLM-Based Extraction of Imaging Features from Radiology Reports: Automating Disease Activity Scoring in Crohn’s Disease\",\"authors\":\"Reza Dehdab MD , Fiona Mankertz MD , Jan Michael Brendel MD , Nour Maalouf MD , Kenan Kaya MD , Saif Afat MD , Shadi Kolahdoozan MD, MPH, PhD , Amir Reza Radmard MD\",\"doi\":\"10.1016/j.acra.2025.07.041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Rationale and Objectives</h3><div>Large Language Models (LLMs) offer a promising solution for extracting structured clinical information from free-text radiology reports. The Simplified Magnetic Resonance Index of Activity (sMARIA) is a validated scoring system used to quantify Crohn’s disease (CD) activity based on Magnetic Resonance Enterography (MRE) findings. This study aims to evaluate the performance of two advanced LLMs in extracting key imaging features and computing sMARIA scores from free-text MRE reports.</div></div><div><h3>Materials and Methods</h3><div>This retrospective study included 117 anonymized free-text MRE reports from patients with confirmed CD. ChatGPT (GPT-4o) and DeepSeek (DeepSeek-R1) were prompted using a structured input designed to extract four key radiologic features relevant to sMARIA: bowel wall thickness, mural edema, perienteric fat stranding, and ulceration. LLM outputs were evaluated against radiologist annotations at both the segment and feature levels. Segment-level agreement was assessed using accuracy, mean absolute error (MAE) and Pearson correlation. Feature-level performance was evaluated using sensitivity, specificity, precision, and F1-score. Errors including confabulations were recorded descriptively<em>.</em></div></div><div><h3>Results</h3><div>ChatGPT achieved a segment-level accuracy of 98.6%, MAE of 0.17, and Pearson correlation of 0.99. DeepSeek achieved 97.3% accuracy, MAE of 0.51, and correlation of 0.96. At the feature level, ChatGPT yielded an F1-score of 98.8% (precision 97.8%, sensitivity 99.9%), while DeepSeek achieved 97.9% (precision 96.0%, sensitivity 99.8%).</div></div><div><h3>Conclusions</h3><div>LLMs demonstrate near-human accuracy in extracting structured information and computing sMARIA scores from free-text MRE reports. This enables automated assessment of CD activity without altering current reporting workflows, supporting longitudinal monitoring and large-scale research. Integration into clinical decision support systems may be feasible in the future, provided appropriate human oversight and validation are ensured.</div></div>\",\"PeriodicalId\":50928,\"journal\":{\"name\":\"Academic Radiology\",\"volume\":\"32 10\",\"pages\":\"Pages 5869-5877\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Academic Radiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1076633225007111\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1076633225007111","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

基本原理和目标：大型语言模型（llm）为从自由文本放射学报告中提取结构化临床信息提供了一个有前途的解决方案。简化磁共振活动指数（sMARIA）是一种经过验证的评分系统，用于量化基于磁共振肠图（MRE）结果的克罗恩病（CD）活动。本研究旨在评估两种高级llm在从自由文本MRE报告中提取关键成像特征和计算sMARIA分数方面的性能。材料和方法：本回顾性研究包括来自确诊CD患者的117份匿名自由文本MRE报告。ChatGPT （gpt - 40）和DeepSeek （DeepSeek- r1）使用结构化输入提示，旨在提取与sMARIA相关的四个关键放射学特征：肠壁厚度、壁水肿、肠周脂肪搁浅和溃疡。LLM输出在段和特征级别上根据放射科医生的注释进行评估。使用准确性、平均绝对误差（MAE）和Pearson相关性评估片段级一致性。使用敏感性、特异性、精确性和f1评分来评估特征水平的表现。包括虚构在内的错误被描述性地记录下来。结果：ChatGPT的段级准确率为98.6%，MAE为0.17，Pearson相关系数为0.99。DeepSeek的准确率为97.3%，MAE为0.51，相关性为0.96。在特征层面，ChatGPT的f1得分为98.8%（精度97.8%，灵敏度99.9%），而DeepSeek的f1得分为97.9%（精度96.0%，灵敏度99.8%）。结论：法学硕士在从自由文本MRE报告中提取结构化信息和计算sMARIA分数方面表现出接近人类的准确性。这可以在不改变当前报告工作流程的情况下自动评估CD活动，支持纵向监测和大规模研究。整合到临床决策支持系统在未来可能是可行的，只要适当的人为监督和验证得到保证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LLM-Based Extraction of Imaging Features from Radiology Reports: Automating Disease Activity Scoring in Crohn’s Disease

Rationale and Objectives

Large Language Models (LLMs) offer a promising solution for extracting structured clinical information from free-text radiology reports. The Simplified Magnetic Resonance Index of Activity (sMARIA) is a validated scoring system used to quantify Crohn’s disease (CD) activity based on Magnetic Resonance Enterography (MRE) findings. This study aims to evaluate the performance of two advanced LLMs in extracting key imaging features and computing sMARIA scores from free-text MRE reports.

Materials and Methods

This retrospective study included 117 anonymized free-text MRE reports from patients with confirmed CD. ChatGPT (GPT-4o) and DeepSeek (DeepSeek-R1) were prompted using a structured input designed to extract four key radiologic features relevant to sMARIA: bowel wall thickness, mural edema, perienteric fat stranding, and ulceration. LLM outputs were evaluated against radiologist annotations at both the segment and feature levels. Segment-level agreement was assessed using accuracy, mean absolute error (MAE) and Pearson correlation. Feature-level performance was evaluated using sensitivity, specificity, precision, and F1-score. Errors including confabulations were recorded descriptively.

Results

ChatGPT achieved a segment-level accuracy of 98.6%, MAE of 0.17, and Pearson correlation of 0.99. DeepSeek achieved 97.3% accuracy, MAE of 0.51, and correlation of 0.96. At the feature level, ChatGPT yielded an F1-score of 98.8% (precision 97.8%, sensitivity 99.9%), while DeepSeek achieved 97.9% (precision 96.0%, sensitivity 99.8%).

Conclusions

LLMs demonstrate near-human accuracy in extracting structured information and computing sMARIA scores from free-text MRE reports. This enables automated assessment of CD activity without altering current reporting workflows, supporting longitudinal monitoring and large-scale research. Integration into clinical decision support systems may be feasible in the future, provided appropriate human oversight and validation are ensured.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Academic Radiology 医学-核医学

CiteScore

7.60

自引率

10.40%

发文量

432

审稿时长

18 days

期刊介绍： Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.