Using a Large Language Model for Postdeployment Monitoring of FDA-Approved Artificial Intelligence: Pulmonary Embolism Detection Use Case.

Vera Sorin, Panagiotis Korfiatis, Alex K Bratt, Tim Leiner, Christoph Wald, Crystal Butler, Cole J Cook, Timothy L Kline, Jeremy D Collins
{"title":"Using a Large Language Model for Postdeployment Monitoring of FDA-Approved Artificial Intelligence: Pulmonary Embolism Detection Use Case.","authors":"Vera Sorin, Panagiotis Korfiatis, Alex K Bratt, Tim Leiner, Christoph Wald, Crystal Butler, Cole J Cook, Timothy L Kline, Jeremy D Collins","doi":"10.1016/j.jacr.2025.06.036","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is increasingly integrated into clinical workflows. The performance of AI in production can diverge from initial evaluations. Postdeployment monitoring (PDM) remains a challenging ingredient of ongoing quality assurance once AI is deployed in clinical production.</p><p><strong>Purpose: </strong>To develop and evaluate a PDM framework that uses large language models (LLMs) for free-text classification of radiology reports, and human oversight. We demonstrate its application to monitor a commercially vended pulmonary embolism (PE) detection AI (CVPED).</p><p><strong>Methods: </strong>We retrospectively analyzed 11,999 CT pulmonary angiography studies performed between April 30, 2023, and June 17, 2024. Ground truth was determined by combining LLM-based radiology report classification and the CVPED outputs, with human review of discrepancies. We simulated a daily monitoring framework to track discrepancies between CVPED and the LLM. Drift was defined when discrepancy rate exceeded a fixed 95% confidence interval for 7 consecutive days. The confidence interval and the optimal retrospective assessment period were determined from a stable dataset with consistent performance. We simulated drift by systematically altering CVPED or LLM sensitivity and specificity, and we modeled an approach to detect data shifts. We incorporated a human-in-the-loop selective alerting framework for continuous prospective evaluation and to investigate potential for incremental detection.</p><p><strong>Results: </strong>Of 11,999 CT pulmonary angiography studies, 1,285 (10.7%) had PE. Overall, 373 (3.1%) had discrepant classifications between CVPED and LLM. Among 111 CVPED-positive and LLM-negative cases, 29 would have triggered an alert due to the radiologist not interacting with CVPED. Of those, 24 were CVPED false-positives, 1 was an LLM false-negative, and the framework ultimately identified 4 true-alerts for incremental PE cases. The optimal retrospective assessment period for drift detection was determined to be 2 months. A 2% to 3% decline in model specificity caused a 2- to 3-fold increase in discrepancies, and a 10% drop in sensitivity was required to produce a similar effect. For example, a 2.5% drop in LLM specificity led to a 1.7-fold increase in CVPED-negative LLM-positive discrepancies, which would have taken 22 days to detect using the proposed framework.</p><p><strong>Conclusion: </strong>A PDM framework combining LLM-based free-text classification with a human-in-the-loop alerting system can continuously track an image-based AI's performance, alert for performance drift, and provide incremental clinical value.</p>","PeriodicalId":73968,"journal":{"name":"Journal of the American College of Radiology : JACR","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American College of Radiology : JACR","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.jacr.2025.06.036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Artificial intelligence (AI) is increasingly integrated into clinical workflows. The performance of AI in production can diverge from initial evaluations. Postdeployment monitoring (PDM) remains a challenging ingredient of ongoing quality assurance once AI is deployed in clinical production.

Purpose: To develop and evaluate a PDM framework that uses large language models (LLMs) for free-text classification of radiology reports, and human oversight. We demonstrate its application to monitor a commercially vended pulmonary embolism (PE) detection AI (CVPED).

Methods: We retrospectively analyzed 11,999 CT pulmonary angiography studies performed between April 30, 2023, and June 17, 2024. Ground truth was determined by combining LLM-based radiology report classification and the CVPED outputs, with human review of discrepancies. We simulated a daily monitoring framework to track discrepancies between CVPED and the LLM. Drift was defined when discrepancy rate exceeded a fixed 95% confidence interval for 7 consecutive days. The confidence interval and the optimal retrospective assessment period were determined from a stable dataset with consistent performance. We simulated drift by systematically altering CVPED or LLM sensitivity and specificity, and we modeled an approach to detect data shifts. We incorporated a human-in-the-loop selective alerting framework for continuous prospective evaluation and to investigate potential for incremental detection.

Results: Of 11,999 CT pulmonary angiography studies, 1,285 (10.7%) had PE. Overall, 373 (3.1%) had discrepant classifications between CVPED and LLM. Among 111 CVPED-positive and LLM-negative cases, 29 would have triggered an alert due to the radiologist not interacting with CVPED. Of those, 24 were CVPED false-positives, 1 was an LLM false-negative, and the framework ultimately identified 4 true-alerts for incremental PE cases. The optimal retrospective assessment period for drift detection was determined to be 2 months. A 2% to 3% decline in model specificity caused a 2- to 3-fold increase in discrepancies, and a 10% drop in sensitivity was required to produce a similar effect. For example, a 2.5% drop in LLM specificity led to a 1.7-fold increase in CVPED-negative LLM-positive discrepancies, which would have taken 22 days to detect using the proposed framework.

Conclusion: A PDM framework combining LLM-based free-text classification with a human-in-the-loop alerting system can continuously track an image-based AI's performance, alert for performance drift, and provide incremental clinical value.

使用大型语言模型进行FDA批准的人工智能部署后监测:肺栓塞检测用例。
背景:人工智能(AI)越来越多地融入临床工作流程。人工智能在生产中的表现可能会偏离最初的评估。一旦人工智能应用于临床生产,部署后监测(PDM)仍然是持续质量保证的一个具有挑战性的组成部分。目的:开发和评估一个PDM框架,该框架使用大型语言模型(llm)对放射学报告进行自由文本分类,并进行人工监督。我们展示了其应用于监测商业出售肺栓塞(PE)检测AI (CVPED)。方法:回顾性分析204/30/04 - 2024年6/17/06期间进行的11999次CT肺血管造影(CTPA)研究。通过结合基于llm的放射学报告分类和CVPED输出,以及人工审查差异来确定基础真相。我们模拟了一个日常监测框架来跟踪CVPED和LLM之间的差异。当差异率连续7天超过固定的95%置信区间(CI)时,定义漂移。CI和最佳回顾性评估期是从具有一致性能的稳定数据集确定的。我们通过系统地改变CVPED或LLM的敏感性和特异性来模拟漂移,并对检测数据移位的方法进行建模。我们整合了一个人在循环选择性报警框架,用于持续的前瞻性评估和调查增量检测的潜力。结果:11999例ctpa中,1285例(10.7%)有PE。总体而言,373例(3.1%)在CVPED和LLM之间有不同的分类。在111例CVPED阳性和llm阴性病例中,29例因放射科医生未与CVPED接触而触发警报。其中,24例为CVPED假阳性,1例为LLM假阴性,该框架最终确定了4例增量PE病例的真警报。漂移检测的最佳回顾性评估期确定为两个月。模型特异性下降2-3%会导致差异增加2-3倍,而敏感性下降10%才能产生类似的效果。例如,LLM特异性下降2.5%导致cvped阴性LLM阳性差异增加1.7倍,使用拟议的框架需要22天才能检测到。结论:将基于llm的自由文本分类与human-in-the-loop报警系统相结合的PDM框架可以持续跟踪基于图像的AI的性能,对性能漂移进行警报,并提供增量临床价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信