Linear Mixed Modeling of Federated Data When Only the Mean, Covariance, and Sample Size Are Available.

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine Pub Date : 2025-01-15 Epub Date: 2024-12-11 DOI:10.1002/sim.10300

Marie Analiz April Limpoco, Christel Faes, Niel Hens

{"title":"Linear Mixed Modeling of Federated Data When Only the Mean, Covariance, and Sample Size Are Available.","authors":"Marie Analiz April Limpoco, Christel Faes, Niel Hens","doi":"10.1002/sim.10300","DOIUrl":null,"url":null,"abstract":"<p><p>In medical research, individual-level patient data provide invaluable information, but the patients' right to confidentiality remains of utmost priority. This poses a huge challenge when estimating statistical models such as a linear mixed model, which is an extension of linear regression models that can account for potential heterogeneity whenever data come from different data providers. Federated learning tackles this hurdle by estimating parameters without retrieving individual-level data. Instead, iterative communication of parameter estimate updates between the data providers and analysts is required. In this article, we propose an alternative framework to federated learning for fitting linear mixed models. Specifically, our approach only requires the mean, covariance, and sample size of multiple covariates from different data providers once. Using the principle of statistical sufficiency within the likelihood framework as theoretical support, this proposed strategy achieves estimates identical to those derived from actual individual-level data. We demonstrate this approach through real data on 15 068 patient records from 70 clinics at the Children's Hospital of Pennsylvania. Assuming that each clinic only shares summary statistics once, we model the COVID-19 polymerase chain reaction test cycle threshold as a function of patient information. Simplicity, communication efficiency, generalisability, and wider scope of implementation in any statistical software distinguish our approach from existing strategies in the literature.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":" ","pages":"e10300"},"PeriodicalIF":1.8000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/sim.10300","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/11 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

In medical research, individual-level patient data provide invaluable information, but the patients' right to confidentiality remains of utmost priority. This poses a huge challenge when estimating statistical models such as a linear mixed model, which is an extension of linear regression models that can account for potential heterogeneity whenever data come from different data providers. Federated learning tackles this hurdle by estimating parameters without retrieving individual-level data. Instead, iterative communication of parameter estimate updates between the data providers and analysts is required. In this article, we propose an alternative framework to federated learning for fitting linear mixed models. Specifically, our approach only requires the mean, covariance, and sample size of multiple covariates from different data providers once. Using the principle of statistical sufficiency within the likelihood framework as theoretical support, this proposed strategy achieves estimates identical to those derived from actual individual-level data. We demonstrate this approach through real data on 15 068 patient records from 70 clinics at the Children's Hospital of Pennsylvania. Assuming that each clinic only shares summary statistics once, we model the COVID-19 polymerase chain reaction test cycle threshold as a function of patient information. Simplicity, communication efficiency, generalisability, and wider scope of implementation in any statistical software distinguish our approach from existing strategies in the literature.

查看原文本刊更多论文

当只有平均值、协方差和样本量可用时，联邦数据的线性混合建模。

在医学研究中，个人层面的患者数据提供了宝贵的信息，但患者的保密权仍然是最优先的。这在估计统计模型（如线性混合模型）时提出了巨大的挑战，线性混合模型是线性回归模型的扩展，可以解释来自不同数据提供者的数据的潜在异质性。联邦学习通过估计参数而不检索个人层面的数据来解决这一障碍。相反，需要在数据提供者和分析人员之间进行参数估计更新的迭代通信。在本文中，我们提出了一个用于拟合线性混合模型的联邦学习的替代框架。具体来说，我们的方法只需要一次来自不同数据提供者的多个协变量的均值、协方差和样本量。利用似然框架内的统计充分性原则作为理论支持，该建议的策略实现了与实际个人数据得出的估计值相同的估计值。我们通过宾夕法尼亚儿童医院70个诊所的15068个病人记录的真实数据来证明这种方法。假设每个诊所只共享汇总统计数据一次，我们将COVID-19聚合酶链反应测试周期阈值建模为患者信息的函数。简单，沟通效率，通用性，和更广泛的实施范围在任何统计软件区分我们的方法从现有的策略在文献中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistics in Medicine 医学-公共卫生、环境卫生与职业卫生

CiteScore

3.40

自引率

10.00%

发文量

334

审稿时长

2-4 weeks

期刊介绍： The journal aims to influence practice in medicine and its associated sciences through the publication of papers on statistical and other quantitative methods. Papers will explain new methods and demonstrate their application, preferably through a substantive, real, motivating example or a comprehensive evaluation based on an illustrative example. Alternatively, papers will report on case-studies where creative use or technical generalizations of established methodology is directed towards a substantive application. Reviews of, and tutorials on, general topics relevant to the application of statistics to medicine will also be published. The main criteria for publication are appropriateness of the statistical methods to a particular medical problem and clarity of exposition. Papers with primarily mathematical content will be excluded. The journal aims to enhance communication between statisticians, clinicians and medical researchers.