Paulino Pérez-Rodríguez, Gustavo de Los Campos, Hao Wu, Ana I Vazquez, Kyle Jones
{"title":"Fast Analysis of Biobank-Size Data and Meta-Analysis using the BGLR R-package.","authors":"Paulino Pérez-Rodríguez, Gustavo de Los Campos, Hao Wu, Ana I Vazquez, Kyle Jones","doi":"10.1093/g3journal/jkae288","DOIUrl":null,"url":null,"abstract":"<p><p>Analyzing human genomic data from biobanks and large-scale genetic evaluations often requires fitting models with a sample size exceeding the number of DNA markers used (n > p). For instance, developing Polygenic Scores (PGS) for humans and genomic prediction for genetic evaluations of agricultural species may require fitting models involving a few thousand SNPs using data with hundreds of thousands of samples. In such cases, computations based on sufficient statistics are more efficient than those based on individual genotype-phenotype data. Additionally, software that admits sufficient statistics as inputs can be used to analyze data from multiple sources jointly without the need to share individual genotype-phenotype data. Therefore, we developed functionality within the BGLR R-package that generates posterior samples for Bayesian shrinkage and variable selection models from sufficient statistics. In this article, we present an overview of the new methods incorporated in the BGLR R-package, demonstrate the use of the new software through simple examples, provide several computational benchmarks, and present a real-data example using data from the UK-Biobank, All of Us, and the HCHS/SOL cohort demonstrating how a joint analysis from multiple cohorts can be implemented without sharing individual genotype-phenotype data, and how a combined analysis can improve the prediction accuracy of PGS for Hispanics--a group severely underrepresented in GWAS data.</p>","PeriodicalId":12468,"journal":{"name":"G3: Genes|Genomes|Genetics","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"G3: Genes|Genomes|Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/g3journal/jkae288","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Analyzing human genomic data from biobanks and large-scale genetic evaluations often requires fitting models with a sample size exceeding the number of DNA markers used (n > p). For instance, developing Polygenic Scores (PGS) for humans and genomic prediction for genetic evaluations of agricultural species may require fitting models involving a few thousand SNPs using data with hundreds of thousands of samples. In such cases, computations based on sufficient statistics are more efficient than those based on individual genotype-phenotype data. Additionally, software that admits sufficient statistics as inputs can be used to analyze data from multiple sources jointly without the need to share individual genotype-phenotype data. Therefore, we developed functionality within the BGLR R-package that generates posterior samples for Bayesian shrinkage and variable selection models from sufficient statistics. In this article, we present an overview of the new methods incorporated in the BGLR R-package, demonstrate the use of the new software through simple examples, provide several computational benchmarks, and present a real-data example using data from the UK-Biobank, All of Us, and the HCHS/SOL cohort demonstrating how a joint analysis from multiple cohorts can be implemented without sharing individual genotype-phenotype data, and how a combined analysis can improve the prediction accuracy of PGS for Hispanics--a group severely underrepresented in GWAS data.
分析来自生物库的人类基因组数据和大规模遗传评估通常需要使用超过所用DNA标记数量的样本量拟合模型。例如,为人类开发多基因评分(PGS)和为农业物种的遗传评估进行基因组预测,可能需要使用数十万个样本的数据来拟合涉及几千个snp的模型。在这种情况下,基于充分统计的计算比基于个体基因型-表型数据的计算更有效。此外,允许足够的统计数据作为输入的软件可以用于联合分析来自多个来源的数据,而无需共享单个基因型-表型数据。因此,我们在BGLR r包中开发了功能,可以从足够的统计数据中为贝叶斯收缩和变量选择模型生成后验样本。在本文中,我们概述了纳入BGLR r包的新方法,通过简单的示例演示了新软件的使用,提供了几个计算基准,并使用来自UK-Biobank, All of Us和HCHS/SOL队列的数据提供了一个实际数据示例,演示了如何在不共享个体基因型-表型数据的情况下实现来自多个队列的联合分析。以及综合分析如何提高西班牙裔美国人的PGS预测准确性——这一群体在GWAS数据中代表性严重不足。
期刊介绍:
G3: Genes, Genomes, Genetics provides a forum for the publication of high‐quality foundational research, particularly research that generates useful genetic and genomic information such as genome maps, single gene studies, genome‐wide association and QTL studies, as well as genome reports, mutant screens, and advances in methods and technology. The Editorial Board of G3 believes that rapid dissemination of these data is the necessary foundation for analysis that leads to mechanistic insights.
G3, published by the Genetics Society of America, meets the critical and growing need of the genetics community for rapid review and publication of important results in all areas of genetics. G3 offers the opportunity to publish the puzzling finding or to present unpublished results that may not have been submitted for review and publication due to a perceived lack of a potential high-impact finding. G3 has earned the DOAJ Seal, which is a mark of certification for open access journals, awarded by DOAJ to journals that achieve a high level of openness, adhere to Best Practice and high publishing standards.