从诊断报告中自动提取结构化数据的开放权重语言模型和检索增强生成：方法和参数的评估。

IF 13.2 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Radiology-Artificial Intelligence Pub Date : 2025-05-01 DOI:10.1148/ryai.240551

Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, Evan Calabrese

{"title":"从诊断报告中自动提取结构化数据的开放权重语言模型和检索增强生成：方法和参数的评估。","authors":"Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, Evan Calabrese","doi":"10.1148/ryai.240551","DOIUrl":null,"url":null,"abstract":"Purpose To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weight language models (LMs) and retrieval-augmented generation (RAG) and to assess the effects of model configuration variables on extraction performance. Materials and Methods This retrospective study used two datasets: 7294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2154 pathology reports annotated for IDH mutation status (January 2017-July 2021). An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations for accuracy of structured data extraction from reports. The effect of model size, quantization, prompting strategies, output formatting, and inference parameters on model accuracy was systematically evaluated. Results The best-performing models achieved up to 98% accuracy in extracting BT-RADS scores from radiology reports and greater than 90% accuracy for extraction of IDH mutation status from pathology reports. The best model was medical fine-tuned Llama 3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models (mean accuracy, 86% vs 75%; P < .001). Model quantization had minimal effect on performance. Few-shot prompting significantly improved accuracy (mean [±SD] increase, 32% ± 32; P = .02). RAG improved performance for complex pathology reports by a mean of 48% ± 11 (P = .001) but not for shorter radiology reports (-8% ± 31; P = .39). Conclusion This study demonstrates the potential of open LMs in automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semiautomated optimization using annotated data are critical for optimal performance. Keywords: Large Language Models, Retrieval-Augmented Generation, Radiology, Pathology, Health Care Reports Supplemental material is available for this article. © RSNA, 2025 See also commentary by Tejani and Rauschecker in this issue.","PeriodicalId":29787,"journal":{"name":"Radiology-Artificial Intelligence","volume":" ","pages":"e240551"},"PeriodicalIF":13.2000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Open-Weight Language Models and Retrieval-Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports: Assessment of Approaches and Parameters.\",\"authors\":\"Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, Evan Calabrese\",\"doi\":\"10.1148/ryai.240551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weight language models (LMs) and retrieval-augmented generation (RAG) and to assess the effects of model configuration variables on extraction performance. Materials and Methods This retrospective study used two datasets: 7294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2154 pathology reports annotated for IDH mutation status (January 2017-July 2021). An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations for accuracy of structured data extraction from reports. The effect of model size, quantization, prompting strategies, output formatting, and inference parameters on model accuracy was systematically evaluated. Results The best-performing models achieved up to 98% accuracy in extracting BT-RADS scores from radiology reports and greater than 90% accuracy for extraction of IDH mutation status from pathology reports. The best model was medical fine-tuned Llama 3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models (mean accuracy, 86% vs 75%; P < .001). Model quantization had minimal effect on performance. Few-shot prompting significantly improved accuracy (mean [±SD] increase, 32% ± 32; P = .02). RAG improved performance for complex pathology reports by a mean of 48% ± 11 (P = .001) but not for shorter radiology reports (-8% ± 31; P = .39). Conclusion This study demonstrates the potential of open LMs in automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semiautomated optimization using annotated data are critical for optimal performance. Keywords: Large Language Models, Retrieval-Augmented Generation, Radiology, Pathology, Health Care Reports Supplemental material is available for this article. © RSNA, 2025 See also commentary by Tejani and Rauschecker in this issue.\",\"PeriodicalId\":29787,\"journal\":{\"name\":\"Radiology-Artificial Intelligence\",\"volume\":\" \",\"pages\":\"e240551\"},\"PeriodicalIF\":13.2000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radiology-Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1148/ryai.240551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology-Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1148/ryai.240551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

“刚刚接受”的论文经过了全面的同行评审，并已被接受发表在《放射学：人工智能》杂志上。这篇文章将经过编辑，布局和校样审查，然后在其最终版本出版。请注意，在最终编辑文章的制作过程中，可能会发现可能影响内容的错误。目的：开发和评估一个使用开放权重语言模型（LMs）和检索增强生成（RAG）从非结构化放射学和病理报告中提取结构化临床信息的自动化系统，并评估模型配置变量对提取性能的影响。材料和方法本回顾性研究使用了两个数据集：7,294份脑肿瘤报告和数据系统（BT-RADS）评分注释的放射学报告和2,154份IDH突变状态注释的病理报告（2017年1月至2021年7月）。开发了一个自动化管道来对各种lm和RAG配置的性能进行基准测试，以确保从报告中提取结构化数据的准确性。系统地评估了模型大小、量化、提示策略、输出格式和推理参数对模型精度的影响。结果表现最好的模型在从放射学报告中提取BT-RADS评分方面的准确率高达98%，在从病理报告中提取IDH突变状态方面的准确率超过90%。最好的模型是医疗微调羊驼。较大的、较新的和领域微调的模型始终优于较旧的和较小的模型(平均准确率，86%对75%；P < 0.001)。模型量化对性能的影响最小。少针提示显著提高准确率（平均提高32%±32%，P = 0.02）。对于复杂的病理报告，RAG提高了48%±11% (P = .001)，但对于较短的放射学报告，RAG提高了8%±31% （P = .39）。结论本研究展示了开放式LMs在从非结构化临床报告中自动提取结构化临床数据以及本地隐私保护应用方面的潜力。仔细的模型选择、快速的工程设计和使用带注释的数据的半自动优化是实现最佳性能的关键。©RSNA, 2025年。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Open-Weight Language Models and Retrieval-Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports: Assessment of Approaches and Parameters.

Purpose To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weight language models (LMs) and retrieval-augmented generation (RAG) and to assess the effects of model configuration variables on extraction performance. Materials and Methods This retrospective study used two datasets: 7294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2154 pathology reports annotated for IDH mutation status (January 2017-July 2021). An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations for accuracy of structured data extraction from reports. The effect of model size, quantization, prompting strategies, output formatting, and inference parameters on model accuracy was systematically evaluated. Results The best-performing models achieved up to 98% accuracy in extracting BT-RADS scores from radiology reports and greater than 90% accuracy for extraction of IDH mutation status from pathology reports. The best model was medical fine-tuned Llama 3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models (mean accuracy, 86% vs 75%; P < .001). Model quantization had minimal effect on performance. Few-shot prompting significantly improved accuracy (mean [±SD] increase, 32% ± 32; P = .02). RAG improved performance for complex pathology reports by a mean of 48% ± 11 (P = .001) but not for shorter radiology reports (-8% ± 31; P = .39). Conclusion This study demonstrates the potential of open LMs in automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semiautomated optimization using annotated data are critical for optimal performance. Keywords: Large Language Models, Retrieval-Augmented Generation, Radiology, Pathology, Health Care Reports Supplemental material is available for this article. © RSNA, 2025 See also commentary by Tejani and Rauschecker in this issue.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radiology-Artificial Intelligence

CiteScore

16.20

自引率

1.00%

发文量

期刊介绍： Radiology: Artificial Intelligence is a bi-monthly publication that focuses on the emerging applications of machine learning and artificial intelligence in the field of imaging across various disciplines. This journal is available online and accepts multiple manuscript types, including Original Research, Technical Developments, Data Resources, Review articles, Editorials, Letters to the Editor and Replies, Special Reports, and AI in Brief.