randPedPCA: rapid approximation of principal components from large pedigrees

IF 3.1 1区 农林科学 Q1 AGRICULTURE, DAIRY & ANIMAL SCIENCE
Hanbin Lee, Rosalind Françoise Craddock, Gregor Gorjanc, Hannes Becher
{"title":"randPedPCA: rapid approximation of principal components from large pedigrees","authors":"Hanbin Lee, Rosalind Françoise Craddock, Gregor Gorjanc, Hannes Becher","doi":"10.1186/s12711-025-00994-y","DOIUrl":null,"url":null,"abstract":"Pedigrees continue to be extremely important in agriculture and conservation genetics, with the pedigrees of modern breeding programmes easily comprising millions of records. This size can make visualising the structure of such pedigrees challenging. Being graphs, pedigrees can be represented as matrices, including, most commonly, the additive (numerator) relationship matrix, $$\\varvec{A}$$ , and its inverse. With these matrices, the structure of pedigrees can then, in principle, be visualised via principal component analysis (PCA). However, the naive PCA of matrices for large pedigrees is challenging due to computational and memory constraints. Furthermore, computing a few leading principal components is usually sufficient for visualising the structure of a pedigree. We present the open-access R package randPedPCA for rapid pedigree PCA using sparse matrices. Our rapid pedigree PCA builds on the fact that matrix-vector multiplications with the additive relationship matrix can be carried out implicitly using the extremely sparse inverse relationship factor, $$\\varvec{L}^{-1}$$ , which can be directly obtained from a given pedigree. We implemented two methods. Randomised singular value decomposition tends to be faster when very few principal components are requested, and Eigen decomposition via the RSpectra library tends to be faster when more principal components are of interest. On simulated data, our package delivers a speed-up greater than 10,000 times compared to naive PCA. It further enables analyses that are impossible with naive PCA. When only two principal components are desired, the randomised PCA method can half the running time required compared to RSpectra, which we demonstrate by analysing the pedigree of the UK Kennel Club registered Labrador Retriever population of almost 1.5 million individuals. The leading principal components of pedigree matrices can be efficiently obtained using randomised singular value decomposition and other methods. Scatter plots of these scores allow for intuitive visualisation of large pedigrees. For large pedigrees, this is considerably faster than rendering plots of a pedigree graph.","PeriodicalId":55120,"journal":{"name":"Genetics Selection Evolution","volume":"178 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genetics Selection Evolution","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12711-025-00994-y","RegionNum":1,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, DAIRY & ANIMAL SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Pedigrees continue to be extremely important in agriculture and conservation genetics, with the pedigrees of modern breeding programmes easily comprising millions of records. This size can make visualising the structure of such pedigrees challenging. Being graphs, pedigrees can be represented as matrices, including, most commonly, the additive (numerator) relationship matrix, $$\varvec{A}$$ , and its inverse. With these matrices, the structure of pedigrees can then, in principle, be visualised via principal component analysis (PCA). However, the naive PCA of matrices for large pedigrees is challenging due to computational and memory constraints. Furthermore, computing a few leading principal components is usually sufficient for visualising the structure of a pedigree. We present the open-access R package randPedPCA for rapid pedigree PCA using sparse matrices. Our rapid pedigree PCA builds on the fact that matrix-vector multiplications with the additive relationship matrix can be carried out implicitly using the extremely sparse inverse relationship factor, $$\varvec{L}^{-1}$$ , which can be directly obtained from a given pedigree. We implemented two methods. Randomised singular value decomposition tends to be faster when very few principal components are requested, and Eigen decomposition via the RSpectra library tends to be faster when more principal components are of interest. On simulated data, our package delivers a speed-up greater than 10,000 times compared to naive PCA. It further enables analyses that are impossible with naive PCA. When only two principal components are desired, the randomised PCA method can half the running time required compared to RSpectra, which we demonstrate by analysing the pedigree of the UK Kennel Club registered Labrador Retriever population of almost 1.5 million individuals. The leading principal components of pedigree matrices can be efficiently obtained using randomised singular value decomposition and other methods. Scatter plots of these scores allow for intuitive visualisation of large pedigrees. For large pedigrees, this is considerably faster than rendering plots of a pedigree graph.
randPedPCA:快速逼近大型谱系的主成分
系谱在农业和保护遗传学中仍然是极其重要的,现代育种计划的系谱很容易包含数百万条记录。这种大小可以使这种谱系的结构可视化具有挑战性。作为图,谱系可以表示为矩阵,包括最常见的加性(分子)关系矩阵$$\varvec{A}$$及其逆矩阵。有了这些矩阵,谱系的结构原则上可以通过主成分分析(PCA)可视化。然而,由于计算和内存的限制,大型谱系矩阵的朴素PCA具有挑战性。此外,计算几个主要成分通常足以可视化谱系的结构。我们提出了一个开放存取的R包randPedPCA,用于使用稀疏矩阵的快速系谱PCA。我们的快速系谱PCA建立在这样一个事实之上,即与可加关系矩阵的矩阵向量乘法可以隐式地使用极其稀疏的逆关系因子$$\varvec{L}^{-1}$$进行,该因子可以直接从给定的系谱中获得。我们实现了两个方法。当需要很少的主成分时,随机奇异值分解往往更快,而当需要更多的主成分时,通过RSpectra库进行特征分解往往更快。在模拟数据上,与原始PCA相比,我们的包提供了超过10,000倍的加速。它进一步实现了原始PCA无法实现的分析。当只需要两个主成分时,随机PCA方法与RSpectra相比可以减少一半的运行时间,我们通过分析英国犬科俱乐部注册的拉布拉多寻回犬种群的近150万只个体的血统来证明这一点。利用随机奇异值分解等方法可以有效地求出系谱矩阵的主成分。这些分数的散点图允许直观地可视化大型谱系。对于大型系谱,这比绘制系谱图要快得多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Genetics Selection Evolution
Genetics Selection Evolution 生物-奶制品与动物科学
CiteScore
6.50
自引率
9.80%
发文量
74
审稿时长
1 months
期刊介绍: Genetics Selection Evolution invites basic, applied and methodological content that will aid the current understanding and the utilization of genetic variability in domestic animal species. Although the focus is on domestic animal species, research on other species is invited if it contributes to the understanding of the use of genetic variability in domestic animals. Genetics Selection Evolution publishes results from all levels of study, from the gene to the quantitative trait, from the individual to the population, the breed or the species. Contributions concerning both the biological approach, from molecular genetics to quantitative genetics, as well as the mathematical approach, from population genetics to statistics, are welcome. Specific areas of interest include but are not limited to: gene and QTL identification, mapping and characterization, analysis of new phenotypes, high-throughput SNP data analysis, functional genomics, cytogenetics, genetic diversity of populations and breeds, genetic evaluation, applied and experimental selection, genomic selection, selection efficiency, and statistical methodology for the genetic analysis of phenotypes with quantitative and mixed inheritance.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信