False hope of a single generalisable AI sepsis prediction model: bias and proposed mitigation strategies for improving performance based on a retrospective multisite cohort study.

IF 6.5 1区医学 Q1 HEALTH CARE SCIENCES & SERVICES

BMJ Quality & Safety Pub Date : 2025-08-18 DOI:10.1136/bmjqs-2024-018328

Rudolf Schnetler, Anton van der Vegt, Vikrant R Kalke, Paul Lane, Ian Scott

{"title":"False hope of a single generalisable AI sepsis prediction model: bias and proposed mitigation strategies for improving performance based on a retrospective multisite cohort study.","authors":"Rudolf Schnetler, Anton van der Vegt, Vikrant R Kalke, Paul Lane, Ian Scott","doi":"10.1136/bmjqs-2024-018328","DOIUrl":null,"url":null,"abstract":"Objective: To identify bias in using a single machine learning (ML) sepsis prediction model across multiple hospitals and care locations; evaluate the impact of six different bias mitigation strategies and propose a generic modelling approach for developing best-performing models.Methods: We developed a baseline ML model to predict sepsis using retrospective data on patients in emergency departments (EDs) and wards across nine hospitals. We set model sensitivity at 70% and determined the number of alerts required to be evaluated (number needed to evaluate (NNE), 95% CI) for each case of true sepsis and the number of hours between the first alert and timestamped outcomes meeting sepsis-3 reference criteria (HTS3). Six bias mitigation models were compared with the baseline model for impact on NNE and HTS3.Results: Across 969 292 admissions, mean NNE for the baseline model was significantly lower for EDs (6.1 patients, 95% CI 6 to 6.2) than for wards (7.5 patients, 95% CI 7.4 to 7.5). Across all sites, median HTS3 was 20 hours (20-21) for wards vs 5 (5-5) for EDs. Bias mitigation models significantly impacted NNE but not HTS3. Compared with the baseline model, the best-performing models for NNE with reduced interhospital variance were those trained separately on data from ED patients or from ward patients across all sites. These models generated the lowest NNE results for all care locations in seven of nine hospitals.Conclusions: Implementing a single sepsis prediction model across all sites and care locations within multihospital systems may be unacceptable given large variances in NNE across multiple sites. Bias mitigation methods can identify models demonstrating improved performance across most sites in reducing alert burden but with no impact on the length of the prediction window.","PeriodicalId":9077,"journal":{"name":"BMJ Quality & Safety","volume":" ","pages":"580-589"},"PeriodicalIF":6.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Quality & Safety","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bmjqs-2024-018328","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To identify bias in using a single machine learning (ML) sepsis prediction model across multiple hospitals and care locations; evaluate the impact of six different bias mitigation strategies and propose a generic modelling approach for developing best-performing models.

Methods: We developed a baseline ML model to predict sepsis using retrospective data on patients in emergency departments (EDs) and wards across nine hospitals. We set model sensitivity at 70% and determined the number of alerts required to be evaluated (number needed to evaluate (NNE), 95% CI) for each case of true sepsis and the number of hours between the first alert and timestamped outcomes meeting sepsis-3 reference criteria (HTS3). Six bias mitigation models were compared with the baseline model for impact on NNE and HTS3.

Results: Across 969 292 admissions, mean NNE for the baseline model was significantly lower for EDs (6.1 patients, 95% CI 6 to 6.2) than for wards (7.5 patients, 95% CI 7.4 to 7.5). Across all sites, median HTS3 was 20 hours (20-21) for wards vs 5 (5-5) for EDs. Bias mitigation models significantly impacted NNE but not HTS3. Compared with the baseline model, the best-performing models for NNE with reduced interhospital variance were those trained separately on data from ED patients or from ward patients across all sites. These models generated the lowest NNE results for all care locations in seven of nine hospitals.

Conclusions: Implementing a single sepsis prediction model across all sites and care locations within multihospital systems may be unacceptable given large variances in NNE across multiple sites. Bias mitigation methods can identify models demonstrating improved performance across most sites in reducing alert burden but with no impact on the length of the prediction window.

查看原文本刊更多论文

对单一可推广的人工智能脓毒症预测模型的错误希望：基于回顾性多地点队列研究的偏见和提出的改善性能的缓解策略

目的：确定在多家医院和护理地点使用单一机器学习（ML）脓毒症预测模型的偏差；评估六种不同的减轻偏见战略的影响，并提出一种通用建模方法，以开发性能最佳的模型。方法：我们开发了一个基线ML模型，利用9家医院急诊科（ed）和病房患者的回顾性数据来预测败血症。我们将模型灵敏度设置为70%，并确定每个真正脓毒症病例需要评估的警报数量（需要评估的数量（NNE）， 95% CI）以及第一次警报与符合脓毒症-3参考标准（HTS3）的时间标记结果之间的小时数。将6种偏倚缓解模型与基线模型进行了比较，以评估对NNE和HTS3的影响。结果：在969292例入院患者中，基线模型的平均NNE在急诊科（6.1例，95% CI 6 - 6.2）显著低于病房（7.5例，95% CI 7.4 - 7.5）。在所有站点中，病房的HTS3中位数为20小时（20-21），而急诊科为5小时（5-5）。偏倚缓解模型对NNE有显著影响，但对HTS3没有影响。与基线模型相比，在减少医院间差异的NNE中，表现最好的模型是那些分别接受急诊科患者或所有医院病房患者数据训练的模型。这些模型在9家医院中的7家的所有护理地点产生了最低的NNE结果。结论：考虑到NNE在多个地点的巨大差异，在多医院系统内的所有地点和护理地点实施单一脓毒症预测模型可能是不可接受的。偏差缓解方法可以识别在大多数站点上表现出改进性能的模型，以减少警报负担，但对预测窗口的长度没有影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMJ Quality & Safety HEALTH CARE SCIENCES & SERVICES-

CiteScore

9.80

自引率

7.40%

发文量

104

审稿时长

4-8 weeks

期刊介绍： BMJ Quality & Safety (previously Quality & Safety in Health Care) is an international peer review publication providing research, opinions, debates and reviews for academics, clinicians and healthcare managers focused on the quality and safety of health care and the science of improvement. The journal receives approximately 1000 manuscripts a year and has an acceptance rate for original research of 12%. Time from submission to first decision averages 22 days and accepted articles are typically published online within 20 days. Its current impact factor is 3.281.