Vera Sorin, Panagiotis Korfiatis, Alex K Bratt, Tim Leiner, Christoph Wald, Crystal Butler, Cole J Cook, Timothy L Kline, Jeremy D Collins
{"title":"使用大型语言模型进行FDA批准的人工智能部署后监测:肺栓塞检测用例。","authors":"Vera Sorin, Panagiotis Korfiatis, Alex K Bratt, Tim Leiner, Christoph Wald, Crystal Butler, Cole J Cook, Timothy L Kline, Jeremy D Collins","doi":"10.1016/j.jacr.2025.06.036","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is increasingly integrated into clinical workflows. The performance of AI in production can diverge from initial evaluations. Postdeployment monitoring (PDM) remains a challenging ingredient of ongoing quality assurance once AI is deployed in clinical production.</p><p><strong>Purpose: </strong>To develop and evaluate a PDM framework that uses large language models (LLMs) for free-text classification of radiology reports, and human oversight. We demonstrate its application to monitor a commercially vended pulmonary embolism (PE) detection AI (CVPED).</p><p><strong>Methods: </strong>We retrospectively analyzed 11,999 CT pulmonary angiography studies performed between April 30, 2023, and June 17, 2024. Ground truth was determined by combining LLM-based radiology report classification and the CVPED outputs, with human review of discrepancies. We simulated a daily monitoring framework to track discrepancies between CVPED and the LLM. Drift was defined when discrepancy rate exceeded a fixed 95% confidence interval for 7 consecutive days. The confidence interval and the optimal retrospective assessment period were determined from a stable dataset with consistent performance. We simulated drift by systematically altering CVPED or LLM sensitivity and specificity, and we modeled an approach to detect data shifts. We incorporated a human-in-the-loop selective alerting framework for continuous prospective evaluation and to investigate potential for incremental detection.</p><p><strong>Results: </strong>Of 11,999 CT pulmonary angiography studies, 1,285 (10.7%) had PE. Overall, 373 (3.1%) had discrepant classifications between CVPED and LLM. Among 111 CVPED-positive and LLM-negative cases, 29 would have triggered an alert due to the radiologist not interacting with CVPED. Of those, 24 were CVPED false-positives, 1 was an LLM false-negative, and the framework ultimately identified 4 true-alerts for incremental PE cases. The optimal retrospective assessment period for drift detection was determined to be 2 months. A 2% to 3% decline in model specificity caused a 2- to 3-fold increase in discrepancies, and a 10% drop in sensitivity was required to produce a similar effect. For example, a 2.5% drop in LLM specificity led to a 1.7-fold increase in CVPED-negative LLM-positive discrepancies, which would have taken 22 days to detect using the proposed framework.</p><p><strong>Conclusion: </strong>A PDM framework combining LLM-based free-text classification with a human-in-the-loop alerting system can continuously track an image-based AI's performance, alert for performance drift, and provide incremental clinical value.</p>","PeriodicalId":73968,"journal":{"name":"Journal of the American College of Radiology : JACR","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using a Large Language Model for Postdeployment Monitoring of FDA-Approved Artificial Intelligence: Pulmonary Embolism Detection Use Case.\",\"authors\":\"Vera Sorin, Panagiotis Korfiatis, Alex K Bratt, Tim Leiner, Christoph Wald, Crystal Butler, Cole J Cook, Timothy L Kline, Jeremy D Collins\",\"doi\":\"10.1016/j.jacr.2025.06.036\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Artificial intelligence (AI) is increasingly integrated into clinical workflows. The performance of AI in production can diverge from initial evaluations. Postdeployment monitoring (PDM) remains a challenging ingredient of ongoing quality assurance once AI is deployed in clinical production.</p><p><strong>Purpose: </strong>To develop and evaluate a PDM framework that uses large language models (LLMs) for free-text classification of radiology reports, and human oversight. We demonstrate its application to monitor a commercially vended pulmonary embolism (PE) detection AI (CVPED).</p><p><strong>Methods: </strong>We retrospectively analyzed 11,999 CT pulmonary angiography studies performed between April 30, 2023, and June 17, 2024. Ground truth was determined by combining LLM-based radiology report classification and the CVPED outputs, with human review of discrepancies. We simulated a daily monitoring framework to track discrepancies between CVPED and the LLM. Drift was defined when discrepancy rate exceeded a fixed 95% confidence interval for 7 consecutive days. The confidence interval and the optimal retrospective assessment period were determined from a stable dataset with consistent performance. We simulated drift by systematically altering CVPED or LLM sensitivity and specificity, and we modeled an approach to detect data shifts. We incorporated a human-in-the-loop selective alerting framework for continuous prospective evaluation and to investigate potential for incremental detection.</p><p><strong>Results: </strong>Of 11,999 CT pulmonary angiography studies, 1,285 (10.7%) had PE. Overall, 373 (3.1%) had discrepant classifications between CVPED and LLM. Among 111 CVPED-positive and LLM-negative cases, 29 would have triggered an alert due to the radiologist not interacting with CVPED. Of those, 24 were CVPED false-positives, 1 was an LLM false-negative, and the framework ultimately identified 4 true-alerts for incremental PE cases. The optimal retrospective assessment period for drift detection was determined to be 2 months. A 2% to 3% decline in model specificity caused a 2- to 3-fold increase in discrepancies, and a 10% drop in sensitivity was required to produce a similar effect. For example, a 2.5% drop in LLM specificity led to a 1.7-fold increase in CVPED-negative LLM-positive discrepancies, which would have taken 22 days to detect using the proposed framework.</p><p><strong>Conclusion: </strong>A PDM framework combining LLM-based free-text classification with a human-in-the-loop alerting system can continuously track an image-based AI's performance, alert for performance drift, and provide incremental clinical value.</p>\",\"PeriodicalId\":73968,\"journal\":{\"name\":\"Journal of the American College of Radiology : JACR\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American College of Radiology : JACR\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jacr.2025.06.036\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American College of Radiology : JACR","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.jacr.2025.06.036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Using a Large Language Model for Postdeployment Monitoring of FDA-Approved Artificial Intelligence: Pulmonary Embolism Detection Use Case.
Background: Artificial intelligence (AI) is increasingly integrated into clinical workflows. The performance of AI in production can diverge from initial evaluations. Postdeployment monitoring (PDM) remains a challenging ingredient of ongoing quality assurance once AI is deployed in clinical production.
Purpose: To develop and evaluate a PDM framework that uses large language models (LLMs) for free-text classification of radiology reports, and human oversight. We demonstrate its application to monitor a commercially vended pulmonary embolism (PE) detection AI (CVPED).
Methods: We retrospectively analyzed 11,999 CT pulmonary angiography studies performed between April 30, 2023, and June 17, 2024. Ground truth was determined by combining LLM-based radiology report classification and the CVPED outputs, with human review of discrepancies. We simulated a daily monitoring framework to track discrepancies between CVPED and the LLM. Drift was defined when discrepancy rate exceeded a fixed 95% confidence interval for 7 consecutive days. The confidence interval and the optimal retrospective assessment period were determined from a stable dataset with consistent performance. We simulated drift by systematically altering CVPED or LLM sensitivity and specificity, and we modeled an approach to detect data shifts. We incorporated a human-in-the-loop selective alerting framework for continuous prospective evaluation and to investigate potential for incremental detection.
Results: Of 11,999 CT pulmonary angiography studies, 1,285 (10.7%) had PE. Overall, 373 (3.1%) had discrepant classifications between CVPED and LLM. Among 111 CVPED-positive and LLM-negative cases, 29 would have triggered an alert due to the radiologist not interacting with CVPED. Of those, 24 were CVPED false-positives, 1 was an LLM false-negative, and the framework ultimately identified 4 true-alerts for incremental PE cases. The optimal retrospective assessment period for drift detection was determined to be 2 months. A 2% to 3% decline in model specificity caused a 2- to 3-fold increase in discrepancies, and a 10% drop in sensitivity was required to produce a similar effect. For example, a 2.5% drop in LLM specificity led to a 1.7-fold increase in CVPED-negative LLM-positive discrepancies, which would have taken 22 days to detect using the proposed framework.
Conclusion: A PDM framework combining LLM-based free-text classification with a human-in-the-loop alerting system can continuously track an image-based AI's performance, alert for performance drift, and provide incremental clinical value.