Can we develop real-world prognostic models using observational healthcare data? Large-scale experiment to investigate model sensitivity to database and phenotypes.

IF 2.6

Diagnostic and prognostic research Pub Date : 2025-04-17 DOI:10.1186/s41512-025-00191-x

Jenna M Reps, Peter R Rijnbeek, Patrick B Ryan

{"title":"Can we develop real-world prognostic models using observational healthcare data? Large-scale experiment to investigate model sensitivity to database and phenotypes.","authors":"Jenna M Reps, Peter R Rijnbeek, Patrick B Ryan","doi":"10.1186/s41512-025-00191-x","DOIUrl":null,"url":null,"abstract":"Background: Large observational healthcare databases are frequently used to develop models to be implemented in real-world clinical practice populations. For example, these databases were used to develop COVID severity models that guided interventions such as who to prioritize vaccinating during the pandemic. However, the clinical setting and observational databases often differ in the types of patients (case mix), and it is a nontrivial process to identify patients with medical conditions (phenotyping) in these databases. In this study, we investigate how sensitive a model's performance is to the choice of development database, population, and outcome phenotype.Methods: We developed > 450 different logistic regression models for nine prediction tasks across seven databases with a range of suitable population and outcome phenotypes. Performance stability within tasks was calculated by applying each model to data created by permuting the database, population, or outcome phenotype. We investigate performance (AUROC, scaled Brier, and calibration-in-the-large) stability and individual risk estimate stability.Results: In general, changing the outcome definitions or population phenotype made little impact on the model validation discrimination. However, validation discrimination was unstable when the database changed. Calibration and Brier performance were unstable when the population, outcome definition, or database changed. This may be problematic if a model developed using observational data is implemented in a real-world setting.Conclusions: These results highlight the importance of validating a model developed using observational data in the clinical setting prior to using it for decision-making. Calibration and Brier score should be evaluated to prevent miscalibrated risk estimates being used to aid clinical decisions.","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"10"},"PeriodicalIF":2.6000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12004590/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and prognostic research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41512-025-00191-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large observational healthcare databases are frequently used to develop models to be implemented in real-world clinical practice populations. For example, these databases were used to develop COVID severity models that guided interventions such as who to prioritize vaccinating during the pandemic. However, the clinical setting and observational databases often differ in the types of patients (case mix), and it is a nontrivial process to identify patients with medical conditions (phenotyping) in these databases. In this study, we investigate how sensitive a model's performance is to the choice of development database, population, and outcome phenotype.

Methods: We developed > 450 different logistic regression models for nine prediction tasks across seven databases with a range of suitable population and outcome phenotypes. Performance stability within tasks was calculated by applying each model to data created by permuting the database, population, or outcome phenotype. We investigate performance (AUROC, scaled Brier, and calibration-in-the-large) stability and individual risk estimate stability.

Results: In general, changing the outcome definitions or population phenotype made little impact on the model validation discrimination. However, validation discrimination was unstable when the database changed. Calibration and Brier performance were unstable when the population, outcome definition, or database changed. This may be problematic if a model developed using observational data is implemented in a real-world setting.

Conclusions: These results highlight the importance of validating a model developed using observational data in the clinical setting prior to using it for decision-making. Calibration and Brier score should be evaluated to prevent miscalibrated risk estimates being used to aid clinical decisions.

Abstract Image

查看原文本刊更多论文

我们能否利用观察性医疗数据开发现实世界的预后模型？大规模实验研究模型对数据库和表型的敏感性。

背景：大型观察性医疗数据库经常用于开发模型，以便在现实世界的临床实践人群中实施。例如，这些数据库用于开发COVID严重程度模型，指导世卫组织等干预措施在大流行期间优先接种疫苗。然而，临床环境和观察数据库通常在患者类型（病例组合）上有所不同，在这些数据库中识别具有医疗条件的患者（表型）是一个重要的过程。在本研究中，我们研究了模型的性能对发展数据库、人口和结果表型的选择有多敏感。方法：我们在7个数据库中为9个预测任务开发了bbbb450种不同的逻辑回归模型，这些模型具有一系列合适的人群和结果表型。通过将每个模型应用于通过排列数据库、人口或结果表型创建的数据来计算任务内的性能稳定性。我们研究了性能（AUROC、标度Brier和校准）稳定性和个体风险估计稳定性。结果：总体而言，改变结局定义或群体表型对模型验证判别影响不大。然而，当数据库发生变化时，验证判别是不稳定的。当人群、结果定义或数据库发生变化时，校准和Brier性能不稳定。如果利用观测数据开发的模型在现实环境中实施，这可能会产生问题。结论：这些结果强调了在将观察数据用于决策之前，在临床环境中验证模型的重要性。应评估校准和Brier评分，以防止使用错误校准的风险估计来辅助临床决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diagnostic and prognostic research

自引率

0.00%

发文量

审稿时长

18 weeks