Galia R Zimerman, Dina Svetlitsky, Meirav Zehavi, Michal Ziv-Ukelson
{"title":"Approximate search for known gene clusters in new genomes using PQ-trees.","authors":"Galia R Zimerman, Dina Svetlitsky, Meirav Zehavi, Michal Ziv-Ukelson","doi":"10.1186/s13015-021-00190-9","DOIUrl":null,"url":null,"abstract":"<p><p>Gene clusters are groups of genes that are co-locally conserved across various genomes, not necessarily in the same order. Their discovery and analysis is valuable in tasks such as gene annotation and prediction of gene interactions, and in the study of genome organization and evolution. The discovery of conserved gene clusters in a given set of genomes is a well studied problem, but with the rapid sequencing of prokaryotic genomes a new problem is inspired. Namely, given an already known gene cluster that was discovered and studied in one genomic dataset, to identify all the instances of the gene cluster in a given new genomic sequence. Thus, we define a new problem in comparative genomics, denoted PQ-TREE SEARCH that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h, integer arguments [Formula: see text] and [Formula: see text], and a new sequence of genes S. The objective is to identify in S approximate new instances of the gene cluster; These instances could vary from the known gene orders by genome rearrangements that are constrained by T, by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by [Formula: see text] and [Formula: see text], respectively. We prove that PQ-TREE SEARCH is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-TREE SEARCH in [Formula: see text] time, where [Formula: see text] is the maximum degree of a node in T and [Formula: see text] is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQ-trees. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"16"},"PeriodicalIF":1.5000,"publicationDate":"2021-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-021-00190-9","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-021-00190-9","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 1
Abstract
Gene clusters are groups of genes that are co-locally conserved across various genomes, not necessarily in the same order. Their discovery and analysis is valuable in tasks such as gene annotation and prediction of gene interactions, and in the study of genome organization and evolution. The discovery of conserved gene clusters in a given set of genomes is a well studied problem, but with the rapid sequencing of prokaryotic genomes a new problem is inspired. Namely, given an already known gene cluster that was discovered and studied in one genomic dataset, to identify all the instances of the gene cluster in a given new genomic sequence. Thus, we define a new problem in comparative genomics, denoted PQ-TREE SEARCH that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h, integer arguments [Formula: see text] and [Formula: see text], and a new sequence of genes S. The objective is to identify in S approximate new instances of the gene cluster; These instances could vary from the known gene orders by genome rearrangements that are constrained by T, by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by [Formula: see text] and [Formula: see text], respectively. We prove that PQ-TREE SEARCH is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-TREE SEARCH in [Formula: see text] time, where [Formula: see text] is the maximum degree of a node in T and [Formula: see text] is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQ-trees. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters.
基因簇是一组基因,它们在不同的基因组中共同保守,不一定按照相同的顺序。它们的发现和分析在诸如基因注释和基因相互作用预测以及基因组组织和进化研究等任务中具有重要价值。在一组给定的基因组中发现保守的基因簇是一个研究得很好的问题,但随着原核生物基因组的快速测序,一个新的问题被激发出来。也就是说,给定一个已知的基因簇,该基因簇在一个基因组数据集中被发现和研究,以识别给定的新基因组序列中该基因簇的所有实例。因此,我们在比较基因组学中定义了一个新的问题,称为PQ-TREE SEARCH,它以表示感兴趣的基因簇的已知基因序列的pq树T、基因到基因替代评分函数h、整数参数[公式:见文]和[公式:见文]以及一个新的基因序列S作为输入,目的是在S中识别基因簇的近似新实例;这些例子可能不同于已知的基因顺序,基因组重排受T约束,基因替换受h约束,基因缺失和插入分别受[公式:见文]和[公式:见文]约束。我们证明了PQ-TREE SEARCH是np困难的,并提出了一种参数化算法,解决了PQ-TREE SEARCH在[Formula: see text]时间内的优化变体,其中[Formula: see text]为节点在T中的最大程度,[Formula: see text]用于隐藏输入大小中的因子多项式。该算法被实现为一个搜索工具,称为PQFinder,并应用于在1487个原核生物基因组数据集中搜索质粒中的染色体基因簇实例。我们报告了29个染色体基因簇在质粒中重排,其中重排由相应的pq树引导。其中一个结果,编码重金属外排泵,被进一步分析,以举例说明如何利用PQFinder来揭示已知基因簇的有趣的新结构变体。
期刊介绍:
Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning.
Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms.
Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.