Sebastian Nowak, Benjamin Wulff, Yannik C Layer, Maike Theis, Alexander Isaak, Babak Salam, Wolfgang Block, Daniel Kuetting, Claus C Pieper, Julian A Luetkens, Ulrike Attenberger, Alois M Sprinkart
{"title":"在从自由文本报告中提取胸片结果方面,确保隐私的开放权重大型语言模型与封闭权重gpt - 40具有竞争力。","authors":"Sebastian Nowak, Benjamin Wulff, Yannik C Layer, Maike Theis, Alexander Isaak, Babak Salam, Wolfgang Block, Daniel Kuetting, Claus C Pieper, Julian A Luetkens, Ulrike Attenberger, Alois M Sprinkart","doi":"10.1148/radiol.240895","DOIUrl":null,"url":null,"abstract":"<p><p>Background Large-scale secondary use of clinical databases requires automated tools for retrospective extraction of structured content from free-text radiology reports. Purpose To share data and insights on the application of privacy-preserving open-weights large language models (LLMs) for reporting content extraction with comparison to standard rule-based systems and the closed-weights LLMs from OpenAI. Materials and Methods In this retrospective exploratory study conducted between May 2024 and September 2024, zero-shot prompting of 17 open-weights LLMs was preformed. These LLMs with model weights released under open licenses were compared with rule-based annotation and with OpenAI's GPT-4o, GPT-4o-mini, GPT-4-turbo, and GPT-3.5-turbo on a manually annotated public English chest radiography dataset (Indiana University, 3927 patients and reports). An annotated nonpublic German chest radiography dataset (18 500 reports, 16 844 patients [10 340 male; mean age, 62.6 years ± 21.5 {SD}]) was used to compare local fine-tuning of all open-weights LLMs via low-rank adaptation and 4-bit quantization to bidirectional encoder representations from transformers (BERT) with different subsets of reports (from 10 to 14 580). Nonoverlapping 95% CIs of macro-averaged F1 scores were defined as relevant differences. Results For the English reports, the highest zero-shot macro-averaged F1 score was observed for GPT-4o (92.4% [95% CI: 87.9, 95.9]); GPT-4o outperformed the rule-based CheXpert [Stanford University] (73.1% [95% CI: 65.1, 79.7]) but was comparable in performance to several open-weights LLMs (top three: Mistral-Large [Mistral AI], 92.6% [95% CI: 88.2, 96.0]; Llama-3.1-70b [Meta AI], 92.2% [95% CI: 87.1, 95.8]; and Llama-3.1-405b [Meta AI]: 90.3% [95% CI: 84.6, 94.5]). For the German reports, Mistral-Large (91.6% [95% CI: 90.5, 92.7]) had the highest zero-shot macro-averaged F1 score among the six other open-weights LLMs and outperformed the rule-based annotation (74.8% [95% CI: 73.3, 76.1]). Using 1000 reports for fine-tuning, all LLMs (top three: Mistral-Large, 94.3% [95% CI: 93.5, 95.2]; OpenBioLLM-70b [Saama]: 93.9% [95% CI: 92.9, 94.8]; and Mixtral-8×22b [Mistral AI]: 93.8% [95% CI: 92.8, 94.7]) achieved significantly higher macro-averaged F1 score than did BERT (86.7% [95% CI: 85.0, 88.3]); however, the differences were not relevant when 2000 or more reports were used for fine-tuning. Conclusion LLMs have the potential to outperform rule-based systems for zero-shot \"out-of-the-box\" structuring of report databases, with privacy-ensuring open-weights LLMs being competitive with closed-weights GPT-4o. Additionally, the open-weights LLM outperformed BERT when moderate numbers of reports were used for fine-tuning. Published under a CC BY 4.0 license. <i>Supplemental material is available for this article.</i> See also the editorial by Gee and Yao in this issue.</p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"314 1","pages":"e240895"},"PeriodicalIF":12.1000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.\",\"authors\":\"Sebastian Nowak, Benjamin Wulff, Yannik C Layer, Maike Theis, Alexander Isaak, Babak Salam, Wolfgang Block, Daniel Kuetting, Claus C Pieper, Julian A Luetkens, Ulrike Attenberger, Alois M Sprinkart\",\"doi\":\"10.1148/radiol.240895\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Background Large-scale secondary use of clinical databases requires automated tools for retrospective extraction of structured content from free-text radiology reports. Purpose To share data and insights on the application of privacy-preserving open-weights large language models (LLMs) for reporting content extraction with comparison to standard rule-based systems and the closed-weights LLMs from OpenAI. Materials and Methods In this retrospective exploratory study conducted between May 2024 and September 2024, zero-shot prompting of 17 open-weights LLMs was preformed. These LLMs with model weights released under open licenses were compared with rule-based annotation and with OpenAI's GPT-4o, GPT-4o-mini, GPT-4-turbo, and GPT-3.5-turbo on a manually annotated public English chest radiography dataset (Indiana University, 3927 patients and reports). An annotated nonpublic German chest radiography dataset (18 500 reports, 16 844 patients [10 340 male; mean age, 62.6 years ± 21.5 {SD}]) was used to compare local fine-tuning of all open-weights LLMs via low-rank adaptation and 4-bit quantization to bidirectional encoder representations from transformers (BERT) with different subsets of reports (from 10 to 14 580). Nonoverlapping 95% CIs of macro-averaged F1 scores were defined as relevant differences. Results For the English reports, the highest zero-shot macro-averaged F1 score was observed for GPT-4o (92.4% [95% CI: 87.9, 95.9]); GPT-4o outperformed the rule-based CheXpert [Stanford University] (73.1% [95% CI: 65.1, 79.7]) but was comparable in performance to several open-weights LLMs (top three: Mistral-Large [Mistral AI], 92.6% [95% CI: 88.2, 96.0]; Llama-3.1-70b [Meta AI], 92.2% [95% CI: 87.1, 95.8]; and Llama-3.1-405b [Meta AI]: 90.3% [95% CI: 84.6, 94.5]). For the German reports, Mistral-Large (91.6% [95% CI: 90.5, 92.7]) had the highest zero-shot macro-averaged F1 score among the six other open-weights LLMs and outperformed the rule-based annotation (74.8% [95% CI: 73.3, 76.1]). Using 1000 reports for fine-tuning, all LLMs (top three: Mistral-Large, 94.3% [95% CI: 93.5, 95.2]; OpenBioLLM-70b [Saama]: 93.9% [95% CI: 92.9, 94.8]; and Mixtral-8×22b [Mistral AI]: 93.8% [95% CI: 92.8, 94.7]) achieved significantly higher macro-averaged F1 score than did BERT (86.7% [95% CI: 85.0, 88.3]); however, the differences were not relevant when 2000 or more reports were used for fine-tuning. Conclusion LLMs have the potential to outperform rule-based systems for zero-shot \\\"out-of-the-box\\\" structuring of report databases, with privacy-ensuring open-weights LLMs being competitive with closed-weights GPT-4o. Additionally, the open-weights LLM outperformed BERT when moderate numbers of reports were used for fine-tuning. Published under a CC BY 4.0 license. <i>Supplemental material is available for this article.</i> See also the editorial by Gee and Yao in this issue.</p>\",\"PeriodicalId\":20896,\"journal\":{\"name\":\"Radiology\",\"volume\":\"314 1\",\"pages\":\"e240895\"},\"PeriodicalIF\":12.1000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1148/radiol.240895\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.240895","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.
Background Large-scale secondary use of clinical databases requires automated tools for retrospective extraction of structured content from free-text radiology reports. Purpose To share data and insights on the application of privacy-preserving open-weights large language models (LLMs) for reporting content extraction with comparison to standard rule-based systems and the closed-weights LLMs from OpenAI. Materials and Methods In this retrospective exploratory study conducted between May 2024 and September 2024, zero-shot prompting of 17 open-weights LLMs was preformed. These LLMs with model weights released under open licenses were compared with rule-based annotation and with OpenAI's GPT-4o, GPT-4o-mini, GPT-4-turbo, and GPT-3.5-turbo on a manually annotated public English chest radiography dataset (Indiana University, 3927 patients and reports). An annotated nonpublic German chest radiography dataset (18 500 reports, 16 844 patients [10 340 male; mean age, 62.6 years ± 21.5 {SD}]) was used to compare local fine-tuning of all open-weights LLMs via low-rank adaptation and 4-bit quantization to bidirectional encoder representations from transformers (BERT) with different subsets of reports (from 10 to 14 580). Nonoverlapping 95% CIs of macro-averaged F1 scores were defined as relevant differences. Results For the English reports, the highest zero-shot macro-averaged F1 score was observed for GPT-4o (92.4% [95% CI: 87.9, 95.9]); GPT-4o outperformed the rule-based CheXpert [Stanford University] (73.1% [95% CI: 65.1, 79.7]) but was comparable in performance to several open-weights LLMs (top three: Mistral-Large [Mistral AI], 92.6% [95% CI: 88.2, 96.0]; Llama-3.1-70b [Meta AI], 92.2% [95% CI: 87.1, 95.8]; and Llama-3.1-405b [Meta AI]: 90.3% [95% CI: 84.6, 94.5]). For the German reports, Mistral-Large (91.6% [95% CI: 90.5, 92.7]) had the highest zero-shot macro-averaged F1 score among the six other open-weights LLMs and outperformed the rule-based annotation (74.8% [95% CI: 73.3, 76.1]). Using 1000 reports for fine-tuning, all LLMs (top three: Mistral-Large, 94.3% [95% CI: 93.5, 95.2]; OpenBioLLM-70b [Saama]: 93.9% [95% CI: 92.9, 94.8]; and Mixtral-8×22b [Mistral AI]: 93.8% [95% CI: 92.8, 94.7]) achieved significantly higher macro-averaged F1 score than did BERT (86.7% [95% CI: 85.0, 88.3]); however, the differences were not relevant when 2000 or more reports were used for fine-tuning. Conclusion LLMs have the potential to outperform rule-based systems for zero-shot "out-of-the-box" structuring of report databases, with privacy-ensuring open-weights LLMs being competitive with closed-weights GPT-4o. Additionally, the open-weights LLM outperformed BERT when moderate numbers of reports were used for fine-tuning. Published under a CC BY 4.0 license. Supplemental material is available for this article. See also the editorial by Gee and Yao in this issue.
期刊介绍:
Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies.
Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.