{"title":"PyamilySeq: A Python Tool for Interpretable Gene (Re)Clustering and Pangenomic Inference Across Species and Genera","authors":"Nicholas J. Dimonaco","doi":"arxiv-2407.19328","DOIUrl":null,"url":null,"abstract":"PyamilySeq is a Python-based tool designed for interpretable gene clustering\nand pangenomic inference, supporting analyses at both species and genus levels.\nIt facilitates the clustering of gene sequences into families based on sequence\nsimilarity using CD-HIT, and can take the output of tried-and-tested sequence\nclustering tools such as CD-HIT, BLAST, DIAMOND, and MMseqs2. PyamilySeq is\ndistinctive in its ability to integrate new sequences into existing clusters,\nproviding a robust framework for iterative analysis while preserving the\noriginal clusters, useful when reannotating genomes. In addition to the\nstandard Species mode which as with other tools performs core-gene analysis\nacross a species range, PyamilySeq can be run in Genus mode where it detects\nthe presence of gene families shared across multiple genera. These features\nenhance the tools applicability for ongoing and past genomic studies and\ncomparative analyses. PyamilySeq generates comprehensive outputs, including\ngene presence-absence matrices and aligned sequence data, enabling downstream\nanalysis and interpretation of the identified gene groups and pangenomic data.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.19328","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
PyamilySeq is a Python-based tool designed for interpretable gene clustering
and pangenomic inference, supporting analyses at both species and genus levels.
It facilitates the clustering of gene sequences into families based on sequence
similarity using CD-HIT, and can take the output of tried-and-tested sequence
clustering tools such as CD-HIT, BLAST, DIAMOND, and MMseqs2. PyamilySeq is
distinctive in its ability to integrate new sequences into existing clusters,
providing a robust framework for iterative analysis while preserving the
original clusters, useful when reannotating genomes. In addition to the
standard Species mode which as with other tools performs core-gene analysis
across a species range, PyamilySeq can be run in Genus mode where it detects
the presence of gene families shared across multiple genera. These features
enhance the tools applicability for ongoing and past genomic studies and
comparative analyses. PyamilySeq generates comprehensive outputs, including
gene presence-absence matrices and aligned sequence data, enabling downstream
analysis and interpretation of the identified gene groups and pangenomic data.