randPedPCA: rapid approximation of principal components from large pedigrees

IF 3.1 1区农林科学 Q1 AGRICULTURE, DAIRY & ANIMAL SCIENCE

Genetics Selection Evolution Pub Date : 2025-08-28 DOI:10.1186/s12711-025-00994-y

Hanbin Lee, Rosalind Françoise Craddock, Gregor Gorjanc, Hannes Becher

{"title":"randPedPCA: rapid approximation of principal components from large pedigrees","authors":"Hanbin Lee, Rosalind Françoise Craddock, Gregor Gorjanc, Hannes Becher","doi":"10.1186/s12711-025-00994-y","DOIUrl":null,"url":null,"abstract":"Pedigrees continue to be extremely important in agriculture and conservation genetics, with the pedigrees of modern breeding programmes easily comprising millions of records. This size can make visualising the structure of such pedigrees challenging. Being graphs, pedigrees can be represented as matrices, including, most commonly, the additive (numerator) relationship matrix, $$\\varvec{A}$$ , and its inverse. With these matrices, the structure of pedigrees can then, in principle, be visualised via principal component analysis (PCA). However, the naive PCA of matrices for large pedigrees is challenging due to computational and memory constraints. Furthermore, computing a few leading principal components is usually sufficient for visualising the structure of a pedigree. We present the open-access R package randPedPCA for rapid pedigree PCA using sparse matrices. Our rapid pedigree PCA builds on the fact that matrix-vector multiplications with the additive relationship matrix can be carried out implicitly using the extremely sparse inverse relationship factor, $$\\varvec{L}^{-1}$$ , which can be directly obtained from a given pedigree. We implemented two methods. Randomised singular value decomposition tends to be faster when very few principal components are requested, and Eigen decomposition via the RSpectra library tends to be faster when more principal components are of interest. On simulated data, our package delivers a speed-up greater than 10,000 times compared to naive PCA. It further enables analyses that are impossible with naive PCA. When only two principal components are desired, the randomised PCA method can half the running time required compared to RSpectra, which we demonstrate by analysing the pedigree of the UK Kennel Club registered Labrador Retriever population of almost 1.5 million individuals. The leading principal components of pedigree matrices can be efficiently obtained using randomised singular value decomposition and other methods. Scatter plots of these scores allow for intuitive visualisation of large pedigrees. For large pedigrees, this is considerably faster than rendering plots of a pedigree graph.","PeriodicalId":55120,"journal":{"name":"Genetics Selection Evolution","volume":"178 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genetics Selection Evolution","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12711-025-00994-y","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, DAIRY & ANIMAL SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Pedigrees continue to be extremely important in agriculture and conservation genetics, with the pedigrees of modern breeding programmes easily comprising millions of records. This size can make visualising the structure of such pedigrees challenging. Being graphs, pedigrees can be represented as matrices, including, most commonly, the additive (numerator) relationship matrix, $$\varvec{A}$$ , and its inverse. With these matrices, the structure of pedigrees can then, in principle, be visualised via principal component analysis (PCA). However, the naive PCA of matrices for large pedigrees is challenging due to computational and memory constraints. Furthermore, computing a few leading principal components is usually sufficient for visualising the structure of a pedigree. We present the open-access R package randPedPCA for rapid pedigree PCA using sparse matrices. Our rapid pedigree PCA builds on the fact that matrix-vector multiplications with the additive relationship matrix can be carried out implicitly using the extremely sparse inverse relationship factor, $$\varvec{L}^{-1}$$ , which can be directly obtained from a given pedigree. We implemented two methods. Randomised singular value decomposition tends to be faster when very few principal components are requested, and Eigen decomposition via the RSpectra library tends to be faster when more principal components are of interest. On simulated data, our package delivers a speed-up greater than 10,000 times compared to naive PCA. It further enables analyses that are impossible with naive PCA. When only two principal components are desired, the randomised PCA method can half the running time required compared to RSpectra, which we demonstrate by analysing the pedigree of the UK Kennel Club registered Labrador Retriever population of almost 1.5 million individuals. The leading principal components of pedigree matrices can be efficiently obtained using randomised singular value decomposition and other methods. Scatter plots of these scores allow for intuitive visualisation of large pedigrees. For large pedigrees, this is considerably faster than rendering plots of a pedigree graph.

查看原文本刊更多论文

randPedPCA：快速逼近大型谱系的主成分

系谱在农业和保护遗传学中仍然是极其重要的，现代育种计划的系谱很容易包含数百万条记录。这种大小可以使这种谱系的结构可视化具有挑战性。作为图，谱系可以表示为矩阵，包括最常见的加性（分子）关系矩阵$$\varvec{A}$$及其逆矩阵。有了这些矩阵，谱系的结构原则上可以通过主成分分析（PCA）可视化。然而，由于计算和内存的限制，大型谱系矩阵的朴素PCA具有挑战性。此外，计算几个主要成分通常足以可视化谱系的结构。我们提出了一个开放存取的R包randPedPCA，用于使用稀疏矩阵的快速系谱PCA。我们的快速系谱PCA建立在这样一个事实之上，即与可加关系矩阵的矩阵向量乘法可以隐式地使用极其稀疏的逆关系因子$$\varvec{L}^{-1}$$进行，该因子可以直接从给定的系谱中获得。我们实现了两个方法。当需要很少的主成分时，随机奇异值分解往往更快，而当需要更多的主成分时，通过RSpectra库进行特征分解往往更快。在模拟数据上，与原始PCA相比，我们的包提供了超过10,000倍的加速。它进一步实现了原始PCA无法实现的分析。当只需要两个主成分时，随机PCA方法与RSpectra相比可以减少一半的运行时间，我们通过分析英国犬科俱乐部注册的拉布拉多寻回犬种群的近150万只个体的血统来证明这一点。利用随机奇异值分解等方法可以有效地求出系谱矩阵的主成分。这些分数的散点图允许直观地可视化大型谱系。对于大型系谱，这比绘制系谱图要快得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genetics Selection Evolution 生物-奶制品与动物科学

CiteScore

6.50

自引率

9.80%

发文量

审稿时长

1 months

期刊介绍： Genetics Selection Evolution invites basic, applied and methodological content that will aid the current understanding and the utilization of genetic variability in domestic animal species. Although the focus is on domestic animal species, research on other species is invited if it contributes to the understanding of the use of genetic variability in domestic animals. Genetics Selection Evolution publishes results from all levels of study, from the gene to the quantitative trait, from the individual to the population, the breed or the species. Contributions concerning both the biological approach, from molecular genetics to quantitative genetics, as well as the mathematical approach, from population genetics to statistics, are welcome. Specific areas of interest include but are not limited to: gene and QTL identification, mapping and characterization, analysis of new phenotypes, high-throughput SNP data analysis, functional genomics, cytogenetics, genetic diversity of populations and breeds, genetic evaluation, applied and experimental selection, genomic selection, selection efficiency, and statistical methodology for the genetic analysis of phenotypes with quantitative and mixed inheritance.