DisC²o-HD: Distributed causal inference with covariates shift for analyzing real-world high-dimensional data.

IF 5.2 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research Pub Date : 2025-01-01

Jiayi Tong, Jie Hu, George Hripcsak, Yang Ning, Yong Chen

{"title":"DisC2o-HD: Distributed causal inference with covariates shift for analyzing real-world high-dimensional data.","authors":"Jiayi Tong, Jie Hu, George Hripcsak, Yang Ning, Yong Chen","doi":"","DOIUrl":null,"url":null,"abstract":"High-dimensional healthcare data, such as electronic health records (EHR) data and claims data, present two primary challenges due to the large number of variables and the need to consolidate data from multiple clinical sites. The third key challenge is the potential existence of heterogeneity in terms of covariate shift. In this paper, we propose a distributed learning algorithm accounting for covariate shift to estimate the average treatment effect (ATE) for high-dimensional data, named DisC2o-HD. Leveraging the surrogate likelihood method, our method calibrates the estimates of the propensity score and outcome models to approximately attain the desired covariate balancing property, while accounting for the covariate shift across multiple clinical sites. We show that our distributed covariate balancing propensity score estimator can approximate the pooled estimator, which is obtained by pooling the data from multiple sites together. The proposed estimator remains consistent if either the propensity score model or the outcome regression model is correctly specified. The semiparametric efficiency bound is achieved when both the propensity score and the outcome models are correctly specified. We conduct simulation studies to demonstrate the performance of the proposed algorithm; additionally, we apply the algorithm to a real-world data set to present the readiness of implementation and validity.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12269483/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Machine Learning Research","FirstCategoryId":"94","ListUrlMain":"","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

High-dimensional healthcare data, such as electronic health records (EHR) data and claims data, present two primary challenges due to the large number of variables and the need to consolidate data from multiple clinical sites. The third key challenge is the potential existence of heterogeneity in terms of covariate shift. In this paper, we propose a distributed learning algorithm accounting for covariate shift to estimate the average treatment effect (ATE) for high-dimensional data, named DisC²o-HD. Leveraging the surrogate likelihood method, our method calibrates the estimates of the propensity score and outcome models to approximately attain the desired covariate balancing property, while accounting for the covariate shift across multiple clinical sites. We show that our distributed covariate balancing propensity score estimator can approximate the pooled estimator, which is obtained by pooling the data from multiple sites together. The proposed estimator remains consistent if either the propensity score model or the outcome regression model is correctly specified. The semiparametric efficiency bound is achieved when both the propensity score and the outcome models are correctly specified. We conduct simulation studies to demonstrate the performance of the proposed algorithm; additionally, we apply the algorithm to a real-world data set to present the readiness of implementation and validity.

Abstract Image

本刊更多论文

用于分析现实世界高维数据的协变量移位的分布式因果推理。

高维医疗保健数据，如电子健康记录（EHR）数据和索赔数据，由于存在大量变量和需要整合来自多个临床站点的数据，带来了两个主要挑战。第三个关键挑战是协变量移位方面异质性的潜在存在。在本文中，我们提出了一个考虑协变量移位的分布式学习算法来估计高维数据的平均处理效果（ATE），命名为disc20 - hd。利用替代似然法，我们的方法校准了倾向评分和结果模型的估计，以近似地达到期望的协变量平衡特性，同时考虑了多个临床地点的协变量转移。我们证明了我们的分布协变量平衡倾向得分估计量可以近似于由多个站点的数据池化而得到的池化估计量。如果倾向得分模型或结果回归模型被正确指定，所提出的估计量保持一致。当倾向得分和结果模型都正确指定时，可以实现半参数效率界。我们进行了仿真研究，以证明所提出算法的性能；此外，我们将算法应用于现实世界的数据集，以展示实现的准备和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Machine Learning Research 工程技术-计算机：人工智能

CiteScore

18.80

自引率

0.00%

发文量

审稿时长

3 months

期刊介绍： The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. JMLR seeks previously unpublished papers on machine learning that contain: new principled algorithms with sound empirical validation, and with justification of theoretical, psychological, or biological nature; experimental and/or theoretical studies yielding new insight into the design and behavior of learning in intelligent systems; accounts of applications of existing techniques that shed light on the strengths and weaknesses of the methods; formalization of new learning tasks (e.g., in the context of new applications) and of methods for assessing performance on those tasks; development of new analytical frameworks that advance theoretical studies of practical learning methods; computational models of data from natural learning systems at the behavioral or neural level; or extremely well-written surveys of existing work.