因果机器学习方法及交叉拟合在高维混杂环境中的应用。

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine Pub Date : 2025-09-01 DOI:10.1002/sim.70272

Susan Ellul, Stijn Vansteelandt, John B Carlin, Margarita Moreno-Betancur

{"title":"因果机器学习方法及交叉拟合在高维混杂环境中的应用。","authors":"Susan Ellul, Stijn Vansteelandt, John B Carlin, Margarita Moreno-Betancur","doi":"10.1002/sim.70272","DOIUrl":null,"url":null,"abstract":"Observational epidemiological studies commonly seek to estimate the causal effect of an exposure on an outcome. Adjustment for potential confounding bias in modern studies is challenging due to the presence of high-dimensional confounding, which occurs when there are many confounders relative to sample size or complex relationships between continuous confounders and exposure and outcome. Doubly robust methods such as Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE) have the potential to address these challenges, using data-adaptive approaches and cross-fitting, but despite recent advances, limited evaluation and guidance are available on their implementation in realistic settings where high-dimensional confounding is present. Motivated by an early-life cohort study, we conducted an extensive simulation study to compare the relative performance of AIPW and TMLE using data-adaptive approaches for estimating the average causal effect (ACE). We evaluated the benefits of using cross-fitting with a varying number of folds, as well as the impact of using a reduced versus full (larger, more diverse) library in the Super Learner ensemble learning approach used for implementation. We found that AIPW and TMLE performed similarly in most cases for estimating the ACE, but TMLE was more stable. Cross-fitting improved the performance of both methods, but was more important for variance estimation and coverage than for point estimates, with the number of folds a less important consideration. Using a full Super Learner library was important to reduce bias and variance in complex scenarios typical of modern health research studies.","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"44 20-22","pages":"e70272"},"PeriodicalIF":1.8000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12457817/pdf/","citationCount":"0","resultStr":"{\"title\":\"Causal Machine Learning Methods and Use of Cross-Fitting in Settings With High-Dimensional Confounding.\",\"authors\":\"Susan Ellul, Stijn Vansteelandt, John B Carlin, Margarita Moreno-Betancur\",\"doi\":\"10.1002/sim.70272\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Observational epidemiological studies commonly seek to estimate the causal effect of an exposure on an outcome. Adjustment for potential confounding bias in modern studies is challenging due to the presence of high-dimensional confounding, which occurs when there are many confounders relative to sample size or complex relationships between continuous confounders and exposure and outcome. Doubly robust methods such as Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE) have the potential to address these challenges, using data-adaptive approaches and cross-fitting, but despite recent advances, limited evaluation and guidance are available on their implementation in realistic settings where high-dimensional confounding is present. Motivated by an early-life cohort study, we conducted an extensive simulation study to compare the relative performance of AIPW and TMLE using data-adaptive approaches for estimating the average causal effect (ACE). We evaluated the benefits of using cross-fitting with a varying number of folds, as well as the impact of using a reduced versus full (larger, more diverse) library in the Super Learner ensemble learning approach used for implementation. We found that AIPW and TMLE performed similarly in most cases for estimating the ACE, but TMLE was more stable. Cross-fitting improved the performance of both methods, but was more important for variance estimation and coverage than for point estimates, with the number of folds a less important consideration. Using a full Super Learner library was important to reduce bias and variance in complex scenarios typical of modern health research studies.\",\"PeriodicalId\":21879,\"journal\":{\"name\":\"Statistics in Medicine\",\"volume\":\"44 20-22\",\"pages\":\"e70272\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12457817/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistics in Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/sim.70272\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/sim.70272","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

观察性流行病学研究通常试图估计暴露对结果的因果影响。由于存在高维混杂，现代研究中对潜在混杂偏倚的调整具有挑战性，当存在与样本量相关的许多混杂因素或连续混杂因素与暴露和结果之间的复杂关系时，就会发生高维混杂。增强逆概率加权（AIPW）和目标最大似然估计（TMLE）等双重鲁棒方法有潜力解决这些挑战，使用数据自适应方法和交叉拟合，但尽管最近取得了进展，但在存在高维混淆的现实环境中，对其实施的评估和指导有限。在一项早期队列研究的激励下，我们进行了一项广泛的模拟研究，使用数据自适应方法来估计平均因果效应（ACE），比较AIPW和TMLE的相对性能。我们评估了使用不同折叠次数的交叉拟合的好处，以及在用于实现的超级学习者集成学习方法中使用减少与完整（更大，更多样化）库的影响。我们发现AIPW和TMLE在大多数情况下对ACE的估计相似，但TMLE更稳定。交叉拟合提高了这两种方法的性能，但对于方差估计和覆盖率比点估计更重要，而折叠次数则不太重要。使用一个完整的超级学习者库对于减少现代健康研究中典型的复杂场景中的偏差和方差非常重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Causal Machine Learning Methods and Use of Cross-Fitting in Settings With High-Dimensional Confounding.

查看原文本刊更多论文

Causal Machine Learning Methods and Use of Cross-Fitting in Settings With High-Dimensional Confounding.

Observational epidemiological studies commonly seek to estimate the causal effect of an exposure on an outcome. Adjustment for potential confounding bias in modern studies is challenging due to the presence of high-dimensional confounding, which occurs when there are many confounders relative to sample size or complex relationships between continuous confounders and exposure and outcome. Doubly robust methods such as Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE) have the potential to address these challenges, using data-adaptive approaches and cross-fitting, but despite recent advances, limited evaluation and guidance are available on their implementation in realistic settings where high-dimensional confounding is present. Motivated by an early-life cohort study, we conducted an extensive simulation study to compare the relative performance of AIPW and TMLE using data-adaptive approaches for estimating the average causal effect (ACE). We evaluated the benefits of using cross-fitting with a varying number of folds, as well as the impact of using a reduced versus full (larger, more diverse) library in the Super Learner ensemble learning approach used for implementation. We found that AIPW and TMLE performed similarly in most cases for estimating the ACE, but TMLE was more stable. Cross-fitting improved the performance of both methods, but was more important for variance estimation and coverage than for point estimates, with the number of folds a less important consideration. Using a full Super Learner library was important to reduce bias and variance in complex scenarios typical of modern health research studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistics in Medicine 医学-公共卫生、环境卫生与职业卫生

CiteScore

3.40

自引率

10.00%

发文量

334

审稿时长

2-4 weeks

期刊介绍： The journal aims to influence practice in medicine and its associated sciences through the publication of papers on statistical and other quantitative methods. Papers will explain new methods and demonstrate their application, preferably through a substantive, real, motivating example or a comprehensive evaluation based on an illustrative example. Alternatively, papers will report on case-studies where creative use or technical generalizations of established methodology is directed towards a substantive application. Reviews of, and tutorials on, general topics relevant to the application of statistics to medicine will also be published. The main criteria for publication are appropriateness of the statistical methods to a particular medical problem and clarity of exposition. Papers with primarily mathematical content will be excluded. The journal aims to enhance communication between statisticians, clinicians and medical researchers.