{"title":"Marker effect p-values for single-step GWAS with the algorithm for proven and young in large genotyped populations","authors":"Natália Galoro Leite, Matias Bermann, Shogo Tsuruta, Ignacy Misztal, Daniela Lourenco","doi":"10.1186/s12711-024-00925-3","DOIUrl":null,"url":null,"abstract":"Single-nucleotide polymorphism (SNP) effects can be backsolved from ssGBLUP genomic estimated breeding values (GEBV) and used for genome-wide association studies (ssGWAS). However, obtaining p-values for those SNP effects relies on the inversion of dense matrices, which poses computational limitations in large genotyped populations. In this study, we present a method to approximate SNP p-values for ssGWAS with many genotyped animals. This method relies on the combination of a sparse approximation of the inverse of the genomic relationship matrix ( $${\\mathbf{G}}_{\\mathbf{A}\\mathbf{P}\\mathbf{Y}}^\\mathbf{-1}$$ ) built with the algorithm for proven and young ( $$\\text{APY}$$ ) and an approximation of the prediction error variance of SNP effects which does not require the inversion of the left-hand side (LHS) of the mixed model equations. To test the proposed p-value computing method, we used a reduced genotyped population of 50K genotyped animals and compared the approximated SNP p-values with benchmark p-values obtained with the direct inverse of LHS built with an exact genomic relationship matrix ( $${\\mathbf{G}}^\\mathbf{-1})$$ . Then, we applied the proposed approximation method to obtain SNP p-values for a larger genotyped population composed of 450K genotyped animals. The same genomic regions on chromosomes 7 and 20 were identified across all p-value computing methods when using 50K genotyped animals. In terms of computational requirements, obtaining p-values with the proposed approximation reduced the wall-clock time by 38 times and the memory requirement by ten times compared to using the exact inversion of the LHS. When the approximation was applied to a population of 450K genotyped animals, two new significant regions on chromosomes 6 and 14 were uncovered, indicating an increase in GWAS detection power when including more genotypes in the analyses. The process of obtaining p-values with the approximation and 450K genotyped individuals took 24.5 wall-clock hours and 87.66GB of memory, which is expected to increase linearly with the addition of noncore genotyped individuals. With the proposed method, obtaining p-values for SNP effects in ssGWAS is computationally feasible in large genotyped populations. The computational cost of obtaining p-values in ssGWAS may no longer be a limitation in extensive populations with many genotyped animals.","PeriodicalId":55120,"journal":{"name":"Genetics Selection Evolution","volume":null,"pages":null},"PeriodicalIF":3.6000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genetics Selection Evolution","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12711-024-00925-3","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, DAIRY & ANIMAL SCIENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Single-nucleotide polymorphism (SNP) effects can be backsolved from ssGBLUP genomic estimated breeding values (GEBV) and used for genome-wide association studies (ssGWAS). However, obtaining p-values for those SNP effects relies on the inversion of dense matrices, which poses computational limitations in large genotyped populations. In this study, we present a method to approximate SNP p-values for ssGWAS with many genotyped animals. This method relies on the combination of a sparse approximation of the inverse of the genomic relationship matrix ( $${\mathbf{G}}_{\mathbf{A}\mathbf{P}\mathbf{Y}}^\mathbf{-1}$$ ) built with the algorithm for proven and young ( $$\text{APY}$$ ) and an approximation of the prediction error variance of SNP effects which does not require the inversion of the left-hand side (LHS) of the mixed model equations. To test the proposed p-value computing method, we used a reduced genotyped population of 50K genotyped animals and compared the approximated SNP p-values with benchmark p-values obtained with the direct inverse of LHS built with an exact genomic relationship matrix ( $${\mathbf{G}}^\mathbf{-1})$$ . Then, we applied the proposed approximation method to obtain SNP p-values for a larger genotyped population composed of 450K genotyped animals. The same genomic regions on chromosomes 7 and 20 were identified across all p-value computing methods when using 50K genotyped animals. In terms of computational requirements, obtaining p-values with the proposed approximation reduced the wall-clock time by 38 times and the memory requirement by ten times compared to using the exact inversion of the LHS. When the approximation was applied to a population of 450K genotyped animals, two new significant regions on chromosomes 6 and 14 were uncovered, indicating an increase in GWAS detection power when including more genotypes in the analyses. The process of obtaining p-values with the approximation and 450K genotyped individuals took 24.5 wall-clock hours and 87.66GB of memory, which is expected to increase linearly with the addition of noncore genotyped individuals. With the proposed method, obtaining p-values for SNP effects in ssGWAS is computationally feasible in large genotyped populations. The computational cost of obtaining p-values in ssGWAS may no longer be a limitation in extensive populations with many genotyped animals.
单核苷酸多态性(SNP)效应可以从 ssGBLUP 基因组估计育种值(GEBV)中反演算出来,并用于全基因组关联研究(ssGWAS)。然而,要获得这些 SNP 效应的 p 值,需要对密集矩阵进行反演,这给大型基因分型群体的计算带来了限制。在本研究中,我们提出了一种方法,用于近似许多基因分型动物的ssGWAS的SNP p值。该方法依赖于对基因组关系矩阵($${mathbf{G}}_{mathbf{A}\mathbf{P}\mathbf{Y}}^\mathbf{-1}$$ )和 SNP 影响预测误差方差的近似值,后者不需要对混合模型方程的左手侧(LHS)进行反演。为了测试所提出的 p 值计算方法,我们使用了一个由 50K 只基因分型动物组成的缩小基因分型群体,并将近似 SNP p 值与使用精确基因组关系矩阵($${mathbf{G}}^\mathbf{-1})建立的 LHS 直接反演得到的基准 p 值进行了比较。然后,我们应用所提出的近似方法获得了由 450K 个基因分型动物组成的更大基因分型群体的 SNP p 值。当使用 50K 只基因分型动物时,所有 p 值计算方法都能确定 7 号和 20 号染色体上的相同基因组区域。在计算要求方面,与使用 LHS 精确反转法相比,使用所提出的近似法获得 p 值的挂钟时间减少了 38 倍,内存需求减少了 10 倍。当把近似值应用于 450K 个基因分型的动物群体时,发现了 6 号和 14 号染色体上两个新的重要区域,这表明当分析中包含更多基因型时,GWAS 的检测能力会提高。利用近似方法和 450K 个基因分型个体获得 p 值的过程耗时 24.5 个壁钟小时,内存 87.66GB,预计随着非核心基因分型个体的增加,p 值将呈线性增长。采用所提出的方法,在ssGWAS中获取SNP效应的p值在大型基因分型群体中是可行的。在有许多基因分型动物的大种群中,在 ssGWAS 中获取 p 值的计算成本可能不再是一个限制因素。
期刊介绍:
Genetics Selection Evolution invites basic, applied and methodological content that will aid the current understanding and the utilization of genetic variability in domestic animal species. Although the focus is on domestic animal species, research on other species is invited if it contributes to the understanding of the use of genetic variability in domestic animals. Genetics Selection Evolution publishes results from all levels of study, from the gene to the quantitative trait, from the individual to the population, the breed or the species. Contributions concerning both the biological approach, from molecular genetics to quantitative genetics, as well as the mathematical approach, from population genetics to statistics, are welcome. Specific areas of interest include but are not limited to: gene and QTL identification, mapping and characterization, analysis of new phenotypes, high-throughput SNP data analysis, functional genomics, cytogenetics, genetic diversity of populations and breeds, genetic evaluation, applied and experimental selection, genomic selection, selection efficiency, and statistical methodology for the genetic analysis of phenotypes with quantitative and mixed inheritance.