Machine Learning Versus Logistic Regression for Propensity Score Estimation: A Benchmark Trial Emulation Against the PARADIGM-HF Randomized Trial.

Kaicheng Wang, Lindsey A Rosman, Haidong Lu
{"title":"Machine Learning Versus Logistic Regression for Propensity Score Estimation: A Benchmark Trial Emulation Against the PARADIGM-HF Randomized Trial.","authors":"Kaicheng Wang, Lindsey A Rosman, Haidong Lu","doi":"10.1101/2025.06.16.25329708","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning (ML) algorithms are increasingly used to estimate propensity score with expectation of improving causal inference. However, the validity of ML-based approaches for confounder selection and adjustment remains unclear. In this study, we emulated the device-stratified secondary analysis of the PARADIGM-HF trial among U.S. veterans with heart failure and implanted cardiac devices from 2016 to 2020. We benchmarked observational estimates from three propensity score approaches against the trial results: (1) logistic regression with pre-specified confounders, (2) generalized boosted models (GBM) using the same pre-specified confounders, and (3) GBM with expanded covariates and automated feature selection. Logistic regression-based propensity score approach yielded estimates closest to the trial (HR = 0.93, 95% CI 0.61-1.42; 23-month RR = 0.86, 95% CI 0.57-1.24 vs. trial HR = 0.81, 95% CI 0.61-1.06). Despite better predictive performance, GBM with pre-specified confounders showed no improvement over the logistic regression approach (HR = 0.97, 95% CI 0.68-1.37; RR = 0.96, 95% CI 0.89-1.98). Notably, GBM with expanded covariates and data-driven automated feature selection substantially increased bias (HR = 0.61, 95% CI 0.30-1.23; RR = 0.69, 95% CI 0.36-1.04). Our findings suggest that ML-based propensity score methods do not inherently improve causal estimation-possibly due to residual confounding from omitted or partially adjusted variables-and may introduce overadjustment bias when combined with automated feature selection, underscoring the importance of careful confounder specification and causal reasoning over algorithmic complexity in causal inference.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204248/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.06.16.25329708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) algorithms are increasingly used to estimate propensity score with expectation of improving causal inference. However, the validity of ML-based approaches for confounder selection and adjustment remains unclear. In this study, we emulated the device-stratified secondary analysis of the PARADIGM-HF trial among U.S. veterans with heart failure and implanted cardiac devices from 2016 to 2020. We benchmarked observational estimates from three propensity score approaches against the trial results: (1) logistic regression with pre-specified confounders, (2) generalized boosted models (GBM) using the same pre-specified confounders, and (3) GBM with expanded covariates and automated feature selection. Logistic regression-based propensity score approach yielded estimates closest to the trial (HR = 0.93, 95% CI 0.61-1.42; 23-month RR = 0.86, 95% CI 0.57-1.24 vs. trial HR = 0.81, 95% CI 0.61-1.06). Despite better predictive performance, GBM with pre-specified confounders showed no improvement over the logistic regression approach (HR = 0.97, 95% CI 0.68-1.37; RR = 0.96, 95% CI 0.89-1.98). Notably, GBM with expanded covariates and data-driven automated feature selection substantially increased bias (HR = 0.61, 95% CI 0.30-1.23; RR = 0.69, 95% CI 0.36-1.04). Our findings suggest that ML-based propensity score methods do not inherently improve causal estimation-possibly due to residual confounding from omitted or partially adjusted variables-and may introduce overadjustment bias when combined with automated feature selection, underscoring the importance of careful confounder specification and causal reasoning over algorithmic complexity in causal inference.

Abstract Image

Abstract Image

Abstract Image

基于机器学习的倾向评分评估:针对随机试验的基准观察分析。
机器学习(ML)倾向评分估计方法越来越多地用于改善协变量平衡和减少偏差的期望,但它们在选择适当混杂因素方面的有效性仍然存在争议。在这项研究中,我们评估了2016年至2020年美国退伍军人事务部植入式心律转复除颤器心力衰竭患者中,苏比利/缬沙坦与血管紧张素转换酶抑制剂和血管紧张素受体阻滞剂对全因死亡率的影响。我们比较了传统逻辑回归和基于ml的倾向评分方法的结果,并将其与PARADIGM-HF随机试验进行了比较。采用先验混杂因素选择的logistic回归估计(HR = 0.93, 95% CI 0.61 - 1.42;27个月RR = 0.87, 95% CI 0.59 - 1.21)与试验结果最接近(HR = 0.81;95% ci 0.61 - 1.06)。相比之下,广义增强模型的表现并不优于传统的逻辑回归,当与数据驱动的混杂因素选择相结合时,可能会放大偏差(HR = 0.63, 95% CI 0.31 - 1.30;Rr = 0.61, 95% ci 0.33 - 1.04)。我们的研究结果表明,基于ml的倾向评分可能会引入过度调整偏差,并强调了主题知识在高维现实世界数据因果推理中的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信