Zheng Peng, Janno S Schouten, Demi Silvertand, Xi Long, Douglas E Lake, H Rob Taal, Hendrik J Niemarkt, Peter Andriessen, Brynne Sullivan, Carola van Pul
{"title":"External Validation Complexities: A Comparative Study of Late-onset Sepsis Prediction Models Across Multiple Clinical Environments.","authors":"Zheng Peng, Janno S Schouten, Demi Silvertand, Xi Long, Douglas E Lake, H Rob Taal, Hendrik J Niemarkt, Peter Andriessen, Brynne Sullivan, Carola van Pul","doi":"10.1109/TBME.2025.3618080","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Neonatal late-onset sepsis (LOS) is a life-threatening condition in preterm infants in neonatal intensive care units (NICUs), with early detection being crucial for improving outcomes. Despite advancements in data-driven prediction models, their generalizability remains uncertain due to a lack of independent validation, particularly on national and international scales. This study evaluates the performance of two LOS prediction models on multiple validation datasets to assess their reliability for clinical implementation.</p><p><strong>Methods: </strong>Two models were validated: (1) a multi-channel feature-based extreme gradient boosting model (MC-XGB) and (2) a deep neural network using raw RR intervals (RR-DNN). Validation was conducted on three NICU datasets: an internal dataset (68 LOS, 100 controls) from the model-development hospital in the Netherlands, a national external dataset (20 LOS, 20 controls) from another Dutch hospital, and an international external dataset (17 LOS, 17 controls) from a U.S. hospital. Model performance was assessed using the area under the receiver operating characteristic curve (AUC) across multiple prediction time windows, with an hourly risk analysis.</p><p><strong>Results: </strong>Both models achieved a peak AUC of 0.82 in the internal dataset, their predictive performance demonstrates variable declines in external datasets. The respective AUCs for RR-DNN and MC-XGB were 0.80 and 0.72 in the national dataset, and 0.69 and 0.60 in the international dataset. This may result from variations in clinical practices, patient demographics, and monitoring technologies.</p><p><strong>Conclusion: </strong>Model performance declined in external validations, highlighting the challenges of implementing predictive models across diverse clinical settings.</p><p><strong>Significance: </strong>This study emphasizes the need for standardized guidelines and improved data sharing to enhance model development and facilitate reliable integration into NICU workflow for improved LOS management.</p>","PeriodicalId":13245,"journal":{"name":"IEEE Transactions on Biomedical Engineering","volume":"PP ","pages":""},"PeriodicalIF":4.5000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Biomedical Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/TBME.2025.3618080","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: Neonatal late-onset sepsis (LOS) is a life-threatening condition in preterm infants in neonatal intensive care units (NICUs), with early detection being crucial for improving outcomes. Despite advancements in data-driven prediction models, their generalizability remains uncertain due to a lack of independent validation, particularly on national and international scales. This study evaluates the performance of two LOS prediction models on multiple validation datasets to assess their reliability for clinical implementation.
Methods: Two models were validated: (1) a multi-channel feature-based extreme gradient boosting model (MC-XGB) and (2) a deep neural network using raw RR intervals (RR-DNN). Validation was conducted on three NICU datasets: an internal dataset (68 LOS, 100 controls) from the model-development hospital in the Netherlands, a national external dataset (20 LOS, 20 controls) from another Dutch hospital, and an international external dataset (17 LOS, 17 controls) from a U.S. hospital. Model performance was assessed using the area under the receiver operating characteristic curve (AUC) across multiple prediction time windows, with an hourly risk analysis.
Results: Both models achieved a peak AUC of 0.82 in the internal dataset, their predictive performance demonstrates variable declines in external datasets. The respective AUCs for RR-DNN and MC-XGB were 0.80 and 0.72 in the national dataset, and 0.69 and 0.60 in the international dataset. This may result from variations in clinical practices, patient demographics, and monitoring technologies.
Conclusion: Model performance declined in external validations, highlighting the challenges of implementing predictive models across diverse clinical settings.
Significance: This study emphasizes the need for standardized guidelines and improved data sharing to enhance model development and facilitate reliable integration into NICU workflow for improved LOS management.
期刊介绍:
IEEE Transactions on Biomedical Engineering contains basic and applied papers dealing with biomedical engineering. Papers range from engineering development in methods and techniques with biomedical applications to experimental and clinical investigations with engineering contributions.