Reza Dehdab MD , Fiona Mankertz MD , Jan Michael Brendel MD , Nour Maalouf MD , Kenan Kaya MD , Saif Afat MD , Shadi Kolahdoozan MD, MPH, PhD , Amir Reza Radmard MD
{"title":"基于llm的影像学特征提取:克罗恩病的疾病活动自动评分。","authors":"Reza Dehdab MD , Fiona Mankertz MD , Jan Michael Brendel MD , Nour Maalouf MD , Kenan Kaya MD , Saif Afat MD , Shadi Kolahdoozan MD, MPH, PhD , Amir Reza Radmard MD","doi":"10.1016/j.acra.2025.07.041","DOIUrl":null,"url":null,"abstract":"<div><h3>Rationale and Objectives</h3><div>Large Language Models (LLMs) offer a promising solution for extracting structured clinical information from free-text radiology reports. The Simplified Magnetic Resonance Index of Activity (sMARIA) is a validated scoring system used to quantify Crohn’s disease (CD) activity based on Magnetic Resonance Enterography (MRE) findings. This study aims to evaluate the performance of two advanced LLMs in extracting key imaging features and computing sMARIA scores from free-text MRE reports.</div></div><div><h3>Materials and Methods</h3><div>This retrospective study included 117 anonymized free-text MRE reports from patients with confirmed CD. ChatGPT (GPT-4o) and DeepSeek (DeepSeek-R1) were prompted using a structured input designed to extract four key radiologic features relevant to sMARIA: bowel wall thickness, mural edema, perienteric fat stranding, and ulceration. LLM outputs were evaluated against radiologist annotations at both the segment and feature levels. Segment-level agreement was assessed using accuracy, mean absolute error (MAE) and Pearson correlation. Feature-level performance was evaluated using sensitivity, specificity, precision, and F1-score. Errors including confabulations were recorded descriptively<em>.</em></div></div><div><h3>Results</h3><div>ChatGPT achieved a segment-level accuracy of 98.6%, MAE of 0.17, and Pearson correlation of 0.99. DeepSeek achieved 97.3% accuracy, MAE of 0.51, and correlation of 0.96. At the feature level, ChatGPT yielded an F1-score of 98.8% (precision 97.8%, sensitivity 99.9%), while DeepSeek achieved 97.9% (precision 96.0%, sensitivity 99.8%).</div></div><div><h3>Conclusions</h3><div>LLMs demonstrate near-human accuracy in extracting structured information and computing sMARIA scores from free-text MRE reports. This enables automated assessment of CD activity without altering current reporting workflows, supporting longitudinal monitoring and large-scale research. Integration into clinical decision support systems may be feasible in the future, provided appropriate human oversight and validation are ensured.</div></div>","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":"32 10","pages":"Pages 5869-5877"},"PeriodicalIF":3.9000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLM-Based Extraction of Imaging Features from Radiology Reports: Automating Disease Activity Scoring in Crohn’s Disease\",\"authors\":\"Reza Dehdab MD , Fiona Mankertz MD , Jan Michael Brendel MD , Nour Maalouf MD , Kenan Kaya MD , Saif Afat MD , Shadi Kolahdoozan MD, MPH, PhD , Amir Reza Radmard MD\",\"doi\":\"10.1016/j.acra.2025.07.041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Rationale and Objectives</h3><div>Large Language Models (LLMs) offer a promising solution for extracting structured clinical information from free-text radiology reports. The Simplified Magnetic Resonance Index of Activity (sMARIA) is a validated scoring system used to quantify Crohn’s disease (CD) activity based on Magnetic Resonance Enterography (MRE) findings. This study aims to evaluate the performance of two advanced LLMs in extracting key imaging features and computing sMARIA scores from free-text MRE reports.</div></div><div><h3>Materials and Methods</h3><div>This retrospective study included 117 anonymized free-text MRE reports from patients with confirmed CD. ChatGPT (GPT-4o) and DeepSeek (DeepSeek-R1) were prompted using a structured input designed to extract four key radiologic features relevant to sMARIA: bowel wall thickness, mural edema, perienteric fat stranding, and ulceration. LLM outputs were evaluated against radiologist annotations at both the segment and feature levels. Segment-level agreement was assessed using accuracy, mean absolute error (MAE) and Pearson correlation. Feature-level performance was evaluated using sensitivity, specificity, precision, and F1-score. Errors including confabulations were recorded descriptively<em>.</em></div></div><div><h3>Results</h3><div>ChatGPT achieved a segment-level accuracy of 98.6%, MAE of 0.17, and Pearson correlation of 0.99. DeepSeek achieved 97.3% accuracy, MAE of 0.51, and correlation of 0.96. At the feature level, ChatGPT yielded an F1-score of 98.8% (precision 97.8%, sensitivity 99.9%), while DeepSeek achieved 97.9% (precision 96.0%, sensitivity 99.8%).</div></div><div><h3>Conclusions</h3><div>LLMs demonstrate near-human accuracy in extracting structured information and computing sMARIA scores from free-text MRE reports. This enables automated assessment of CD activity without altering current reporting workflows, supporting longitudinal monitoring and large-scale research. Integration into clinical decision support systems may be feasible in the future, provided appropriate human oversight and validation are ensured.</div></div>\",\"PeriodicalId\":50928,\"journal\":{\"name\":\"Academic Radiology\",\"volume\":\"32 10\",\"pages\":\"Pages 5869-5877\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Academic Radiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1076633225007111\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1076633225007111","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
LLM-Based Extraction of Imaging Features from Radiology Reports: Automating Disease Activity Scoring in Crohn’s Disease
Rationale and Objectives
Large Language Models (LLMs) offer a promising solution for extracting structured clinical information from free-text radiology reports. The Simplified Magnetic Resonance Index of Activity (sMARIA) is a validated scoring system used to quantify Crohn’s disease (CD) activity based on Magnetic Resonance Enterography (MRE) findings. This study aims to evaluate the performance of two advanced LLMs in extracting key imaging features and computing sMARIA scores from free-text MRE reports.
Materials and Methods
This retrospective study included 117 anonymized free-text MRE reports from patients with confirmed CD. ChatGPT (GPT-4o) and DeepSeek (DeepSeek-R1) were prompted using a structured input designed to extract four key radiologic features relevant to sMARIA: bowel wall thickness, mural edema, perienteric fat stranding, and ulceration. LLM outputs were evaluated against radiologist annotations at both the segment and feature levels. Segment-level agreement was assessed using accuracy, mean absolute error (MAE) and Pearson correlation. Feature-level performance was evaluated using sensitivity, specificity, precision, and F1-score. Errors including confabulations were recorded descriptively.
Results
ChatGPT achieved a segment-level accuracy of 98.6%, MAE of 0.17, and Pearson correlation of 0.99. DeepSeek achieved 97.3% accuracy, MAE of 0.51, and correlation of 0.96. At the feature level, ChatGPT yielded an F1-score of 98.8% (precision 97.8%, sensitivity 99.9%), while DeepSeek achieved 97.9% (precision 96.0%, sensitivity 99.8%).
Conclusions
LLMs demonstrate near-human accuracy in extracting structured information and computing sMARIA scores from free-text MRE reports. This enables automated assessment of CD activity without altering current reporting workflows, supporting longitudinal monitoring and large-scale research. Integration into clinical decision support systems may be feasible in the future, provided appropriate human oversight and validation are ensured.
期刊介绍:
Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.