{"title":"CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R.","authors":"David McCoy, Alan Hubbard, Mark Van der Laan","doi":"10.21105/joss.04181","DOIUrl":null,"url":null,"abstract":"<p><p>Statistical causal inference of mixed exposures has been limited by reliance on parametric models and, until recently, by researchers considering only one exposure at a time, usually estimated as a beta coefficient in a generalized linear regression model (GLM). This independent assessment of exposures poorly estimates the joint impact of a collection of the same exposures in a realistic exposure setting. Marginal methods for mixture variable selection such as ridge/lasso regression are biased by linear assumptions and the interactions modeled are chosen by the user. Clustering methods such as principal component regression lose both interpretability and valid inference. Newer mixture methods such as quantile g-computation (Keil et al., 2020) are biased by linear/additive assumptions. More flexible methods such as Bayesian kernel machine regression (BKMR)(Bobb et al., 2014) are sensitive to the choice of tuning parameters, are computationally taxing and lack an interpretable and robust summary statistic of dose-response relationships. No methods currently exist which finds the best flexible model to adjust for covariates while applying a non-parametric model that targets for interactions in a mixture and delivers valid inference for a target parameter. Non-parametric methods such as decision trees are a useful tool to evaluate combined exposures by finding partitions in the joint-exposure (mixture) space that best explain the variance in an outcome. However, current methods using decision trees to assess statistical inference for interactions are biased and are prone to overfitting by using the full data to both identify nodes in the tree and make statistical inference given these nodes. Other methods have used an independent test set to derive inference which does not use the full data. The CVtreeMLE R package provides researchers in (bio)statistics, epidemiology, and environmental health sciences with access to state-of-the-art statistical methodology for evaluating the causal effects of a data-adaptively determined mixed exposure using decision trees. Our target audience are those analysts who would normally use a potentially biased GLM based model for a mixed exposure. Instead, we hope to provide users with a non-parametric statistical machine where users simply specify the exposures, covariates and outcome, CVtreeMLE then determines if a best fitting decision tree exists and delivers interpretable results.</p>","PeriodicalId":16635,"journal":{"name":"Journal of open source software","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312067/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of open source software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21105/joss.04181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Statistical causal inference of mixed exposures has been limited by reliance on parametric models and, until recently, by researchers considering only one exposure at a time, usually estimated as a beta coefficient in a generalized linear regression model (GLM). This independent assessment of exposures poorly estimates the joint impact of a collection of the same exposures in a realistic exposure setting. Marginal methods for mixture variable selection such as ridge/lasso regression are biased by linear assumptions and the interactions modeled are chosen by the user. Clustering methods such as principal component regression lose both interpretability and valid inference. Newer mixture methods such as quantile g-computation (Keil et al., 2020) are biased by linear/additive assumptions. More flexible methods such as Bayesian kernel machine regression (BKMR)(Bobb et al., 2014) are sensitive to the choice of tuning parameters, are computationally taxing and lack an interpretable and robust summary statistic of dose-response relationships. No methods currently exist which finds the best flexible model to adjust for covariates while applying a non-parametric model that targets for interactions in a mixture and delivers valid inference for a target parameter. Non-parametric methods such as decision trees are a useful tool to evaluate combined exposures by finding partitions in the joint-exposure (mixture) space that best explain the variance in an outcome. However, current methods using decision trees to assess statistical inference for interactions are biased and are prone to overfitting by using the full data to both identify nodes in the tree and make statistical inference given these nodes. Other methods have used an independent test set to derive inference which does not use the full data. The CVtreeMLE R package provides researchers in (bio)statistics, epidemiology, and environmental health sciences with access to state-of-the-art statistical methodology for evaluating the causal effects of a data-adaptively determined mixed exposure using decision trees. Our target audience are those analysts who would normally use a potentially biased GLM based model for a mixed exposure. Instead, we hope to provide users with a non-parametric statistical machine where users simply specify the exposures, covariates and outcome, CVtreeMLE then determines if a best fitting decision tree exists and delivers interpretable results.
混合暴露的统计因果推断受到参数模型的限制,直到最近,研究人员一次只考虑一种暴露,通常在广义线性回归模型(GLM)中估计为β系数。这种对暴露的独立评估不能很好地估计在实际暴露环境中一系列相同暴露的共同影响。混合变量选择的边际方法,如ridge/lasso回归,受到线性假设的偏差,而模型的相互作用由用户选择。主成分回归等聚类方法失去了可解释性和有效推理。较新的混合方法,如分位数g计算(Keil et al., 2020)受到线性/可加性假设的影响。更灵活的方法,如贝叶斯核机回归(BKMR)(Bobb等人,2014)对调优参数的选择很敏感,计算量很大,并且缺乏可解释和稳健的剂量-反应关系汇总统计。目前还没有一种方法可以找到最灵活的模型来调整协变量,同时应用非参数模型来针对混合物中的相互作用并提供目标参数的有效推断。非参数方法,如决策树,是通过在联合暴露(混合)空间中找到最能解释结果方差的分区来评估组合暴露的有用工具。然而,目前使用决策树来评估交互的统计推断的方法是有偏见的,并且容易过度拟合,因为使用完整的数据来识别树中的节点并根据这些节点进行统计推断。其他方法使用独立的测试集来推导不使用完整数据的推理。CVtreeMLE R包为(生物)统计学、流行病学和环境卫生科学领域的研究人员提供了最先进的统计方法,用于使用决策树评估数据自适应确定的混合暴露的因果效应。我们的目标受众是那些通常使用可能有偏差的基于GLM的混合敞口模型的分析师。相反,我们希望为用户提供一个非参数统计机,用户只需指定曝光,协变量和结果,CVtreeMLE然后确定是否存在最佳拟合决策树并提供可解释的结果。