Machine Learning Versus Logistic Regression for Propensity Score Estimation: A Benchmark Trial Emulation Against the PARADIGM-HF Randomized Trial.

medRxiv : the preprint server for health sciences Pub Date : 2025-09-26 DOI:10.1101/2025.06.16.25329708

Kaicheng Wang, Lindsey A Rosman, Haidong Lu

{"title":"Machine Learning Versus Logistic Regression for Propensity Score Estimation: A Benchmark Trial Emulation Against the PARADIGM-HF Randomized Trial.","authors":"Kaicheng Wang, Lindsey A Rosman, Haidong Lu","doi":"10.1101/2025.06.16.25329708","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning (ML) algorithms are increasingly used to estimate propensity score with expectation of improving causal inference. However, the validity of ML-based approaches for confounder selection and adjustment remains unclear. In this study, we emulated the device-stratified secondary analysis of the PARADIGM-HF trial among U.S. veterans with heart failure and implanted cardiac devices from 2016 to 2020. We benchmarked observational estimates from three propensity score approaches against the trial results: (1) logistic regression with pre-specified confounders, (2) generalized boosted models (GBM) using the same pre-specified confounders, and (3) GBM with expanded covariates and automated feature selection. Logistic regression-based propensity score approach yielded estimates closest to the trial (HR = 0.93, 95% CI 0.61-1.42; 23-month RR = 0.86, 95% CI 0.57-1.24 vs. trial HR = 0.81, 95% CI 0.61-1.06). Despite better predictive performance, GBM with pre-specified confounders showed no improvement over the logistic regression approach (HR = 0.97, 95% CI 0.68-1.37; RR = 0.96, 95% CI 0.89-1.98). Notably, GBM with expanded covariates and data-driven automated feature selection substantially increased bias (HR = 0.61, 95% CI 0.30-1.23; RR = 0.69, 95% CI 0.36-1.04). Our findings suggest that ML-based propensity score methods do not inherently improve causal estimation-possibly due to residual confounding from omitted or partially adjusted variables-and may introduce overadjustment bias when combined with automated feature selection, underscoring the importance of careful confounder specification and causal reasoning over algorithmic complexity in causal inference.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204248/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.06.16.25329708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning (ML) algorithms are increasingly used to estimate propensity score with expectation of improving causal inference. However, the validity of ML-based approaches for confounder selection and adjustment remains unclear. In this study, we emulated the device-stratified secondary analysis of the PARADIGM-HF trial among U.S. veterans with heart failure and implanted cardiac devices from 2016 to 2020. We benchmarked observational estimates from three propensity score approaches against the trial results: (1) logistic regression with pre-specified confounders, (2) generalized boosted models (GBM) using the same pre-specified confounders, and (3) GBM with expanded covariates and automated feature selection. Logistic regression-based propensity score approach yielded estimates closest to the trial (HR = 0.93, 95% CI 0.61-1.42; 23-month RR = 0.86, 95% CI 0.57-1.24 vs. trial HR = 0.81, 95% CI 0.61-1.06). Despite better predictive performance, GBM with pre-specified confounders showed no improvement over the logistic regression approach (HR = 0.97, 95% CI 0.68-1.37; RR = 0.96, 95% CI 0.89-1.98). Notably, GBM with expanded covariates and data-driven automated feature selection substantially increased bias (HR = 0.61, 95% CI 0.30-1.23; RR = 0.69, 95% CI 0.36-1.04). Our findings suggest that ML-based propensity score methods do not inherently improve causal estimation-possibly due to residual confounding from omitted or partially adjusted variables-and may introduce overadjustment bias when combined with automated feature selection, underscoring the importance of careful confounder specification and causal reasoning over algorithmic complexity in causal inference.

Abstract Image

查看原文本刊更多论文

基于机器学习的倾向评分评估：针对随机试验的基准观察分析。

机器学习（ML）倾向评分估计方法越来越多地用于改善协变量平衡和减少偏差的期望，但它们在选择适当混杂因素方面的有效性仍然存在争议。在这项研究中，我们评估了2016年至2020年美国退伍军人事务部植入式心律转复除颤器心力衰竭患者中，苏比利/缬沙坦与血管紧张素转换酶抑制剂和血管紧张素受体阻滞剂对全因死亡率的影响。我们比较了传统逻辑回归和基于ml的倾向评分方法的结果，并将其与PARADIGM-HF随机试验进行了比较。采用先验混杂因素选择的logistic回归估计(HR = 0.93, 95% CI 0.61 - 1.42；27个月RR = 0.87, 95% CI 0.59 - 1.21)与试验结果最接近(HR = 0.81；95% ci 0.61 - 1.06)。相比之下，广义增强模型的表现并不优于传统的逻辑回归，当与数据驱动的混杂因素选择相结合时，可能会放大偏差(HR = 0.63, 95% CI 0.31 - 1.30；Rr = 0.61, 95% ci 0.33 - 1.04)。我们的研究结果表明，基于ml的倾向评分可能会引入过度调整偏差，并强调了主题知识在高维现实世界数据因果推理中的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv : the preprint server for health sciences

自引率

0.00%

发文量