联合学习用于多位点临床环境中遗传变异的致病性注释。

IF 5.4

Bioinformatics (Oxford, England) Pub Date : 2025-09-19 DOI:10.1093/bioinformatics/btaf523

Nigreisy Montalvo, Francisco Requena, Emidio Capriotti, Antonio Rausell

{"title":"联合学习用于多位点临床环境中遗传变异的致病性注释。","authors":"Nigreisy Montalvo, Francisco Requena, Emidio Capriotti, Antonio Rausell","doi":"10.1093/bioinformatics/btaf523","DOIUrl":null,"url":null,"abstract":"Motivation: Rare diseases collectively affect 5% of the population. However, fewer than 50% of rare disease patients receive a molecular diagnosis after whole genome sequencing. Supervised machine Learning is a valuable approach for the pathogenicity scoring of human genetic variants. However, existing methods are often trained on curated but limited central repositories, resulting in poor accuracy when tested on external cohorts. Yet, large collections of variants generated at hospitals and research institutions remain inaccessible to machine-learning purposes because of privacy and legal constraints. Federated learning (FL) algorithms have been recently developed enabling institutions to collaboratively train models without sharing their local datasets.Results: Here, we present a proof-of-concept study evaluating the effectiveness of federated learn-ing for the clinical classification of genetic variants. A comprehensive array of diverse FL strategies was assessed for coding and non-coding Single Nucleotide Variants as well as Copy Number Variants. Our results showed that federated models generally achieved com-parable or superior performance to traditional centralized learning. In addition, federated models reached a robust generalization to independent sets with smaller data fractions as compared to their centralized model counterparts. Our findings support the adoption of FL to establish secure multi-institutional collaborations in human variant interpretation.Availability: All source code required to reproduce the results presented in this manuscript, implemented in Python, is available under the GNU General Public License v3 at https://github.com/RausellLab/FedLearnVar.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Federated Learning for the pathogenicity annotation of genetic variants in multi-site clinical settings.\",\"authors\":\"Nigreisy Montalvo, Francisco Requena, Emidio Capriotti, Antonio Rausell\",\"doi\":\"10.1093/bioinformatics/btaf523\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Rare diseases collectively affect 5% of the population. However, fewer than 50% of rare disease patients receive a molecular diagnosis after whole genome sequencing. Supervised machine Learning is a valuable approach for the pathogenicity scoring of human genetic variants. However, existing methods are often trained on curated but limited central repositories, resulting in poor accuracy when tested on external cohorts. Yet, large collections of variants generated at hospitals and research institutions remain inaccessible to machine-learning purposes because of privacy and legal constraints. Federated learning (FL) algorithms have been recently developed enabling institutions to collaboratively train models without sharing their local datasets.Results: Here, we present a proof-of-concept study evaluating the effectiveness of federated learn-ing for the clinical classification of genetic variants. A comprehensive array of diverse FL strategies was assessed for coding and non-coding Single Nucleotide Variants as well as Copy Number Variants. Our results showed that federated models generally achieved com-parable or superior performance to traditional centralized learning. In addition, federated models reached a robust generalization to independent sets with smaller data fractions as compared to their centralized model counterparts. Our findings support the adoption of FL to establish secure multi-institutional collaborations in human variant interpretation.Availability: All source code required to reproduce the results presented in this manuscript, implemented in Python, is available under the GNU General Public License v3 at https://github.com/RausellLab/FedLearnVar.Supplementary information: Supplementary data are available at Bioinformatics online.\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf523\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf523","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

动机：罕见病总共影响了5%的人口。然而，只有不到50%的罕见病患者在全基因组测序后得到了分子诊断。监督式机器学习对于人类遗传变异的致病性评分是一种有价值的方法。然而，现有的方法通常是在管理但有限的中央存储库上进行训练的，导致在对外部队列进行测试时准确性较差。然而，由于隐私和法律限制，医院和研究机构产生的大量变体仍然无法用于机器学习目的。最近开发的联邦学习（FL）算法使机构能够在不共享本地数据集的情况下协作训练模型。结果：在这里，我们提出了一项概念验证研究，评估联合学习对遗传变异临床分类的有效性。对编码和非编码单核苷酸变体以及拷贝数变体进行了全面的多种FL策略评估。我们的研究结果表明，联邦模型总体上取得了与传统集中式学习相当或更好的性能。此外，与集中式模型相比，联邦模型实现了对具有较小数据部分的独立集的鲁棒泛化。我们的研究结果支持采用FL在人类变异解释中建立安全的多机构合作。可用性：复制本文中呈现的结果所需的所有源代码，用Python实现，在GNU通用公共许可证v3下可在https://github.com/RausellLab/FedLearnVar.Supplementary上获得：补充数据可在Bioinformatics在线获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Federated Learning for the pathogenicity annotation of genetic variants in multi-site clinical settings.

Motivation: Rare diseases collectively affect 5% of the population. However, fewer than 50% of rare disease patients receive a molecular diagnosis after whole genome sequencing. Supervised machine Learning is a valuable approach for the pathogenicity scoring of human genetic variants. However, existing methods are often trained on curated but limited central repositories, resulting in poor accuracy when tested on external cohorts. Yet, large collections of variants generated at hospitals and research institutions remain inaccessible to machine-learning purposes because of privacy and legal constraints. Federated learning (FL) algorithms have been recently developed enabling institutions to collaboratively train models without sharing their local datasets.

Results: Here, we present a proof-of-concept study evaluating the effectiveness of federated learn-ing for the clinical classification of genetic variants. A comprehensive array of diverse FL strategies was assessed for coding and non-coding Single Nucleotide Variants as well as Copy Number Variants. Our results showed that federated models generally achieved com-parable or superior performance to traditional centralized learning. In addition, federated models reached a robust generalization to independent sets with smaller data fractions as compared to their centralized model counterparts. Our findings support the adoption of FL to establish secure multi-institutional collaborations in human variant interpretation.

Availability: All source code required to reproduce the results presented in this manuscript, implemented in Python, is available under the GNU General Public License v3 at https://github.com/RausellLab/FedLearnVar.

Supplementary information: Supplementary data are available at Bioinformatics online.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量