Ritoban Kundu, Xu Shi, Kumar Kshitij Patel, Lucila Ohno-Machado, Maxwell Salvatore, Peter X K Song, Bhramar Mukherjee
{"title":"Privacy-Enhancing Sequential Learning under Heterogeneous Selection Bias in Multi-Site EHR Data.","authors":"Ritoban Kundu, Xu Shi, Kumar Kshitij Patel, Lucila Ohno-Machado, Maxwell Salvatore, Peter X K Song, Bhramar Mukherjee","doi":"10.1101/2025.09.26.25336642","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To develop privacy-enhancing statistical methods for estimation of binary disease risk model association parameters across multiple electronic health record (EHR) sites with heterogeneous selection mechanisms, without sharing raw individual-level data. We illustrate their utility through a cross-biobank analysis of smoking and 97 cancer subtypes using data from the NIH All of Us (AOU) and the Michigan Genomics Initiative (MGI).</p><p><strong>Materials and methods: </strong>Large-scale biobanks often follow heterogeneous recruitment strategies and store data in separate cloud-based platforms, making centralized algorithms infeasible. To address this, we propose two decentralized sequential estimators namely, Sequential Pseudo-likelihood (SPL) and Sequential Augmented Inverse Probability Weighting (SAIPW) that leverage external population-level information to adjust for selection bias, with valid variance estimation. SAIPW additionally protects against misspecification of the selection model using flexible machine learning based auxiliary outcome models. We compare SPL and SAIPW with the existing Sequential Unweighted (SUW) estimator and with centralized and meta learning extensions of IPW and AIPW in simulations under both correctly specified and misspecified selection mechanisms. We apply the methods to harmonized data from MGI ( <i>n</i> = 50,935) and AOU ( <i>n</i> = 241,563) to estimate smoking-cancer associations.</p><p><strong>Results: </strong>In simulations, SUW exhibited substantial bias and poor coverage. SPL and SAIPW yielded unbiased estimates with valid coverage probabilities under correct model specification, with SAIPW remaining robust under selection model misspecification. Both approaches showed no notable efficiency loss relative to centralized methods. Meta-learning methods were efficient for large sites but failed in settings with small cohort sizes and rare outcome prevalence. In real-data analysis, strong associations were consistently identified between smoking and cancers of the lung, bladder, and larynx, aligning with established epidemiological evidence.</p><p><strong>Conclusion: </strong>Our framework enables valid, privacy-enhancing inference across EHR cohorts with heterogeneous selection, supporting scalable, decentralized research using real-world data.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486029/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.09.26.25336642","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: To develop privacy-enhancing statistical methods for estimation of binary disease risk model association parameters across multiple electronic health record (EHR) sites with heterogeneous selection mechanisms, without sharing raw individual-level data. We illustrate their utility through a cross-biobank analysis of smoking and 97 cancer subtypes using data from the NIH All of Us (AOU) and the Michigan Genomics Initiative (MGI).
Materials and methods: Large-scale biobanks often follow heterogeneous recruitment strategies and store data in separate cloud-based platforms, making centralized algorithms infeasible. To address this, we propose two decentralized sequential estimators namely, Sequential Pseudo-likelihood (SPL) and Sequential Augmented Inverse Probability Weighting (SAIPW) that leverage external population-level information to adjust for selection bias, with valid variance estimation. SAIPW additionally protects against misspecification of the selection model using flexible machine learning based auxiliary outcome models. We compare SPL and SAIPW with the existing Sequential Unweighted (SUW) estimator and with centralized and meta learning extensions of IPW and AIPW in simulations under both correctly specified and misspecified selection mechanisms. We apply the methods to harmonized data from MGI ( n = 50,935) and AOU ( n = 241,563) to estimate smoking-cancer associations.
Results: In simulations, SUW exhibited substantial bias and poor coverage. SPL and SAIPW yielded unbiased estimates with valid coverage probabilities under correct model specification, with SAIPW remaining robust under selection model misspecification. Both approaches showed no notable efficiency loss relative to centralized methods. Meta-learning methods were efficient for large sites but failed in settings with small cohort sizes and rare outcome prevalence. In real-data analysis, strong associations were consistently identified between smoking and cancers of the lung, bladder, and larynx, aligning with established epidemiological evidence.
Conclusion: Our framework enables valid, privacy-enhancing inference across EHR cohorts with heterogeneous selection, supporting scalable, decentralized research using real-world data.
目的:在不共享原始个人数据的情况下,开发具有异质选择机制的多个电子健康记录(EHR)站点间二元疾病风险模型关联参数估计的增强隐私的统计方法。我们通过使用来自NIH All of Us (AOU)和Michigan Genomics Initiative (MGI)的数据对吸烟和97种癌症亚型进行交叉生物库分析,说明了它们的实用性。材料和方法:大型生物库通常采用异构招聘策略,将数据存储在独立的云平台上,使得集中算法不可行的。为了解决这个问题,我们提出了两个分散的顺序估计器,即顺序伪似然(SPL)和顺序增广逆概率加权(SAIPW),它们利用外部人口水平信息来调整选择偏差,并进行有效的方差估计。SAIPW还使用灵活的基于机器学习的辅助结果模型来防止选择模型的错误说明。在正确指定和错误指定的选择机制下,我们将SPL和SAIPW与现有的顺序无加权(SUW)估计器以及IPW和AIPW的集中和元学习扩展进行了比较。我们将这些方法应用于来自MGI (n = 50,935)和AOU (n = 241,563)的统一数据,以估计吸烟与癌症的关联。结果:在模拟中,SUW表现出明显的偏倚和低覆盖率。SPL和SAIPW在正确的模型规范下产生具有有效覆盖概率的无偏估计,SAIPW在选择模型错误规范下保持鲁棒性。两种方法相对于集中式方法没有明显的效率损失。元学习方法在大型研究场所是有效的,但在小队列规模和罕见的结果流行率的环境中是失败的。在实际数据分析中,吸烟与肺癌、膀胱癌和喉癌之间的密切联系得到了一致的确认,这与已有的流行病学证据一致。结论:我们的框架能够通过异构选择在EHR队列中进行有效的、增强隐私的推断,支持使用真实世界数据进行可扩展的、分散的研究。