使用各种提示策略检测和分类放射学报告中的关键发现的开箱即用的大型语言模型。

IF 6.1 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

American Journal of Roentgenology Pub Date : 2025-09-10 DOI:10.2214/AJR.25.33469

Ish A Talati, Juan M Zambrano Chaves, Avisha Das, Imon Banerjee, Daniel L Rubin

{"title":"使用各种提示策略检测和分类放射学报告中的关键发现的开箱即用的大型语言模型。","authors":"Ish A Talati, Juan M Zambrano Chaves, Avisha Das, Imon Banerjee, Daniel L Rubin","doi":"10.2214/AJR.25.33469","DOIUrl":null,"url":null,"abstract":"Background: The increasing complexity and volume of radiology reports present challenges for timely critical findings communication. Purpose: To evaluate the performance of two out-of-the-box LLMs in detecting and classifying critical findings in radiology reports using various prompt strategies. Methods: The analysis included 252 radiology reports of varying modalities and anatomic regions extracted from the MIMIC-III database, divided into a prompt engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set of 180 chest radiography reports was extracted from the CheXpert Plus database. Reports were manually reviewed to identify critical findings and classify such findings into one of three categories (true critical finding, known/expected critical finding, equivocal critical finding). Following prompt engineering using various prompt strategies, a final prompt for optimal true critical findings detection was selected. Two general-purpose LLMs, GPT-4 and Mistral-7B, processed reports in the test sets using the final prompt. Evaluation included automated text similarity metrics (BLEU-1, ROUGE-F1, G-Eval) and manual performance metrics (precision, recall). Results: For true critical findings, zero-shot, few-shot static (five examples), and few-shot dynamic (five examples) prompting yielded BLEU-1 of 0.691, 0.778, and 0.748; ROUGE-F1 of 0.706, 0.797, and 0.773; and G-Eval of 0.428, 0.573, and 0.516. Precision and recall for true critical findings, known/expected critical findings, and equivocal critical findings, in the holdout test set for GPT-4 were 90.1% and 86.9%, 80.9% and 85.0%, and 80.5% and 94.3%; in the holdout test set for Mistral-7B were 75.6% and 77.4%, 34.1% and 70.0%, and 41.3% and 74.3%; in the external test set for GPT-4 were 82.6% and 98.3%, 76.9% and 71.4%, and 70.8% and 85.0%; and in the external test set for Mistral-7B were 75.0% and 93.1%, 33.3% and 92.9%, and 34.0% and 80.0%. Conclusion: Out-of-the-box LLMs were used to detect and classify arbitrary numbers of critical findings in radiology reports. The optimal model for true critical findings entailed a few-shot static approach. Clinical Impact: The study shows a role of contemporary general-purpose models in adapting to specialized medical tasks using minimal data annotation.","PeriodicalId":55529,"journal":{"name":"American Journal of Roentgenology","volume":" ","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.\",\"authors\":\"Ish A Talati, Juan M Zambrano Chaves, Avisha Das, Imon Banerjee, Daniel L Rubin\",\"doi\":\"10.2214/AJR.25.33469\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The increasing complexity and volume of radiology reports present challenges for timely critical findings communication. Purpose: To evaluate the performance of two out-of-the-box LLMs in detecting and classifying critical findings in radiology reports using various prompt strategies. Methods: The analysis included 252 radiology reports of varying modalities and anatomic regions extracted from the MIMIC-III database, divided into a prompt engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set of 180 chest radiography reports was extracted from the CheXpert Plus database. Reports were manually reviewed to identify critical findings and classify such findings into one of three categories (true critical finding, known/expected critical finding, equivocal critical finding). Following prompt engineering using various prompt strategies, a final prompt for optimal true critical findings detection was selected. Two general-purpose LLMs, GPT-4 and Mistral-7B, processed reports in the test sets using the final prompt. Evaluation included automated text similarity metrics (BLEU-1, ROUGE-F1, G-Eval) and manual performance metrics (precision, recall). Results: For true critical findings, zero-shot, few-shot static (five examples), and few-shot dynamic (five examples) prompting yielded BLEU-1 of 0.691, 0.778, and 0.748; ROUGE-F1 of 0.706, 0.797, and 0.773; and G-Eval of 0.428, 0.573, and 0.516. Precision and recall for true critical findings, known/expected critical findings, and equivocal critical findings, in the holdout test set for GPT-4 were 90.1% and 86.9%, 80.9% and 85.0%, and 80.5% and 94.3%; in the holdout test set for Mistral-7B were 75.6% and 77.4%, 34.1% and 70.0%, and 41.3% and 74.3%; in the external test set for GPT-4 were 82.6% and 98.3%, 76.9% and 71.4%, and 70.8% and 85.0%; and in the external test set for Mistral-7B were 75.0% and 93.1%, 33.3% and 92.9%, and 34.0% and 80.0%. Conclusion: Out-of-the-box LLMs were used to detect and classify arbitrary numbers of critical findings in radiology reports. The optimal model for true critical findings entailed a few-shot static approach. Clinical Impact: The study shows a role of contemporary general-purpose models in adapting to specialized medical tasks using minimal data annotation.\",\"PeriodicalId\":55529,\"journal\":{\"name\":\"American Journal of Roentgenology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Journal of Roentgenology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2214/AJR.25.33469\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Roentgenology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2214/AJR.25.33469","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

背景：放射学报告的复杂性和数量的增加对及时沟通关键发现提出了挑战。目的：评估两个开箱即用的llm在使用各种提示策略检测和分类放射学报告中的关键发现方面的表现。方法：分析了从MIMIC-III数据库中提取的252份不同模式和解剖区域的放射学报告，分为提示工程调整组（50份报告）、保留测试组（125份报告）和剩余77份报告（用作少量提示的示例）。从CheXpert Plus数据库中提取了180份胸片报告的外部测试集。报告是手动审查的，以确定关键发现，并将这些发现分为三类（真正的关键发现，已知/预期的关键发现，模棱两可的关键发现）。在使用各种提示策略的提示工程之后，选择了最佳真实关键发现检测的最终提示。两种通用llm， GPT-4和Mistral-7B，使用最终提示处理测试集的报告。评估包括自动文本相似度指标（blue -1, ROUGE-F1, G-Eval）和手动性能指标（精度，召回率）。结果：对于真正的关键发现，零射击、少量射击静态（5个例子）和少量射击动态（5个例子）提示产生的BLEU-1分别为0.691、0.778和0.748；ROUGE-F1分别为0.706、0.797和0.773；G-Eval分别为0.428、0.573和0.516。在GPT-4的holdout测试集中，真实关键结果、已知/预期关键结果和模棱两可关键结果的准确率和召回率分别为90.1%和86.9%、80.9%和85.0%、80.5%和94.3%；Mistral-7B滞留试验组分别为75.6%和77.4%、34.1%和70.0%、41.3%和74.3%；GPT-4外测组分别为82.6%和98.3%、76.9%和71.4%、70.8%和85.0%；Mistral-7B在外部测试中分别为75.0%和93.1%、33.3%和92.9%、34.0%和80.0%。结论：开箱即用的llm可用于检测和分类放射学报告中任意数量的关键发现。真正关键发现的最佳模型需要几次静态方法。临床影响：该研究显示了当代通用模型在使用最小数据注释适应专业医疗任务方面的作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

Background: The increasing complexity and volume of radiology reports present challenges for timely critical findings communication. Purpose: To evaluate the performance of two out-of-the-box LLMs in detecting and classifying critical findings in radiology reports using various prompt strategies. Methods: The analysis included 252 radiology reports of varying modalities and anatomic regions extracted from the MIMIC-III database, divided into a prompt engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set of 180 chest radiography reports was extracted from the CheXpert Plus database. Reports were manually reviewed to identify critical findings and classify such findings into one of three categories (true critical finding, known/expected critical finding, equivocal critical finding). Following prompt engineering using various prompt strategies, a final prompt for optimal true critical findings detection was selected. Two general-purpose LLMs, GPT-4 and Mistral-7B, processed reports in the test sets using the final prompt. Evaluation included automated text similarity metrics (BLEU-1, ROUGE-F1, G-Eval) and manual performance metrics (precision, recall). Results: For true critical findings, zero-shot, few-shot static (five examples), and few-shot dynamic (five examples) prompting yielded BLEU-1 of 0.691, 0.778, and 0.748; ROUGE-F1 of 0.706, 0.797, and 0.773; and G-Eval of 0.428, 0.573, and 0.516. Precision and recall for true critical findings, known/expected critical findings, and equivocal critical findings, in the holdout test set for GPT-4 were 90.1% and 86.9%, 80.9% and 85.0%, and 80.5% and 94.3%; in the holdout test set for Mistral-7B were 75.6% and 77.4%, 34.1% and 70.0%, and 41.3% and 74.3%; in the external test set for GPT-4 were 82.6% and 98.3%, 76.9% and 71.4%, and 70.8% and 85.0%; and in the external test set for Mistral-7B were 75.0% and 93.1%, 33.3% and 92.9%, and 34.0% and 80.0%. Conclusion: Out-of-the-box LLMs were used to detect and classify arbitrary numbers of critical findings in radiology reports. The optimal model for true critical findings entailed a few-shot static approach. Clinical Impact: The study shows a role of contemporary general-purpose models in adapting to specialized medical tasks using minimal data annotation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

American Journal of Roentgenology 医学-核医学

CiteScore

12.80

自引率

4.00%

发文量

920

审稿时长

3 months

期刊介绍： Founded in 1907, the monthly American Journal of Roentgenology (AJR) is the world’s longest continuously published general radiology journal. AJR is recognized as among the specialty’s leading peer-reviewed journals and has a worldwide circulation of close to 25,000. The journal publishes clinically-oriented articles across all radiology subspecialties, seeking relevance to radiologists’ daily practice. The journal publishes hundreds of articles annually with a diverse range of formats, including original research, reviews, clinical perspectives, editorials, and other short reports. The journal engages its audience through a spectrum of social media and digital communication activities.