DAPCy: a Python package for the discriminant analysis of principal components method for population genetic analyses.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-06-18 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf143

Alejandro Correa Rojo, Pieter Moris, Hanne Meuwissen, Pieter Monsieurs, Dirk Valkenborg

{"title":"DAPCy: a Python package for the discriminant analysis of principal components method for population genetic analyses.","authors":"Alejandro Correa Rojo, Pieter Moris, Hanne Meuwissen, Pieter Monsieurs, Dirk Valkenborg","doi":"10.1093/bioadv/vbaf143","DOIUrl":null,"url":null,"abstract":"Summary: The Discriminant Analysis of Principal Components method is a pivotal tool in population genetics, combining principal component analysis and linear discriminant analysis to assess the genetic structure of populations using genetic markers, focusing on the description of variation between genetic clusters. Despite its utility, the original R implementation in the adegenet package faces computational challenges with large genomic datasets. To address these limitations, we introduce DAPCy, a Python package leveraging the scikit-learn library to enhance the method's scalability and efficiency. DAPCy supports large datasets by utilizing compressed sparse matrices and truncated singular value decomposition for dimensionality reduction, coupled with training-test cross-validation for robust model evaluation. It also includes modules for de novo genetic clustering and extensive visualization and reporting capabilities. Compared to the original R implementation, DAPCy can process genomic datasets with thousands of samples and features in less computational time and with reduced memory usage. To show DAPCy's computational capabilities, we benchmarked it with the R implementation using the Plasmodium falciparum dataset from MalariaGEN and the 1000 Genomes Project.Availability and implementation: DAPCy can be installed as a Python package through pip. Source code is available on https://gitlab.com/uhasselt-bioinfo/dapcy. Documentation and a tutorial can be found on https://uhasselt-bioinfo.gitlab.io/dapcy/.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf143"},"PeriodicalIF":2.8000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12237503/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Summary: The Discriminant Analysis of Principal Components method is a pivotal tool in population genetics, combining principal component analysis and linear discriminant analysis to assess the genetic structure of populations using genetic markers, focusing on the description of variation between genetic clusters. Despite its utility, the original R implementation in the adegenet package faces computational challenges with large genomic datasets. To address these limitations, we introduce DAPCy, a Python package leveraging the scikit-learn library to enhance the method's scalability and efficiency. DAPCy supports large datasets by utilizing compressed sparse matrices and truncated singular value decomposition for dimensionality reduction, coupled with training-test cross-validation for robust model evaluation. It also includes modules for de novo genetic clustering and extensive visualization and reporting capabilities. Compared to the original R implementation, DAPCy can process genomic datasets with thousands of samples and features in less computational time and with reduced memory usage. To show DAPCy's computational capabilities, we benchmarked it with the R implementation using the Plasmodium falciparum dataset from MalariaGEN and the 1000 Genomes Project.

Availability and implementation: DAPCy can be installed as a Python package through pip. Source code is available on https://gitlab.com/uhasselt-bioinfo/dapcy. Documentation and a tutorial can be found on https://uhasselt-bioinfo.gitlab.io/dapcy/.

Abstract Image

查看原文本刊更多论文

用于群体遗传分析的主成分判别分析方法的Python包。

摘要：主成分判别分析法是群体遗传学研究的重要工具，它将主成分分析与线性判别分析相结合，利用遗传标记对群体的遗传结构进行评价，重点描述遗传聚类之间的变异。尽管它很实用，但原始的R实现在adegenet包中面临着大型基因组数据集的计算挑战。为了解决这些限制，我们引入了DAPCy，这是一个Python包，利用scikit-learn库来增强方法的可伸缩性和效率。DAPCy支持大型数据集，利用压缩稀疏矩阵和截断奇异值分解进行降维，再加上训练-测试交叉验证进行鲁棒模型评估。它还包括用于从头遗传聚类和广泛的可视化和报告功能的模块。与最初的R实现相比，DAPCy可以在更少的计算时间和更少的内存使用中处理具有数千个样本和特征的基因组数据集。为了展示DAPCy的计算能力，我们使用来自MalariaGEN和1000基因组计划的恶性疟原虫数据集与R实现对其进行基准测试。可用性和实现：可以通过pip将DAPCy作为Python包安装。源代码可在https://gitlab.com/uhasselt-bioinfo/dapcy上获得。文档和教程可以在https://uhasselt-bioinfo.gitlab.io/dapcy/上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量