Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data Pub Date : 2024-02-29 eCollection Date: 2024-01-01 DOI:10.3389/fdata.2024.1266031

Dmitry Kolobkov, Satyarth Mishra Sharma, Aleksandr Medvedev, Mikhail Lebedev, Egor Kosaretskiy, Ruslan Vakhitov

{"title":"Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project.","authors":"Dmitry Kolobkov, Satyarth Mishra Sharma, Aleksandr Medvedev, Mikhail Lebedev, Egor Kosaretskiy, Ruslan Vakhitov","doi":"10.3389/fdata.2024.1266031","DOIUrl":null,"url":null,"abstract":"<p><p>Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1266031"},"PeriodicalIF":2.4000,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10937521/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2024.1266031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.

查看原文本刊更多论文

联合学习在基因组数据方面的功效：对英国生物库和 1000 个基因组项目的研究。

将来自多个来源的训练数据结合起来，可以增加样本量，减少混杂因素，从而建立更准确、偏差更小的机器学习模型。然而，在医疗保健领域，数据保管人往往不允许直接汇集数据，因为他们有责任尽量减少敏感信息的暴露。联盟学习以分散的方式训练模型，从而降低了数据泄漏的风险，为这一问题提供了一个很有前景的解决方案。虽然联合学习在临床数据上的应用越来越多，但其在个人层面基因组数据上的功效还未得到研究。本研究通过研究联合学习在两种情况下的适用性，为基因组数据的采用奠定了基础：英国生物库数据的表型预测和千人基因组计划数据的祖先预测。我们的研究表明，即使在节点间存在显著异质性的情况下，在分割成独立节点的数据上训练的联合模型也能获得接近集中模型的性能。此外，我们还研究了联合模型的准确性如何受到通信频率的影响，并提出了降低计算复杂性或通信成本的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊