{"title":"Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy.","authors":"Qiuyi Zhang, Satish Rao, Tandy Warnow","doi":"10.1186/s13015-019-0136-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was <math><mrow><mi>D</mi> <mi>C</mi> <msub><mi>M</mi> <mrow><mi>NJ</mi></mrow> </msub> <mo>,</mo></mrow> </math> published in SODA 2001. The main empirical advantage of <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> over other AFC methods is its use of neighbor joining (<i>NJ</i>) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy.</p><p><strong>Results: </strong>In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"2"},"PeriodicalIF":1.5000,"publicationDate":"2019-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0136-9","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-019-0136-9","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 12
Abstract
Background: Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was published in SODA 2001. The main empirical advantage of over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy.
Results: In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).
背景:绝对快速收敛(AFC)系统发育估计方法是一种已经被证明可以高概率恢复真树的方法,给定序列的长度是树中叶子数量的多项式(一旦最短和最长的分支权值固定)。虽然有大量关于AFC方法的文献,但就经验表现而言,最好的是发表在SODA 2001上的D C M NJ。与其他AFC方法相比,DCM NJ的主要经验优势在于它使用邻居连接(NJ)在较小的分类群子集上构建树,然后使用超树方法将这些树组合成完整的物种集上的树;相比之下,其他AFC方法本质上依赖于相互独立计算的四重奏树,与邻居连接相比,这降低了精度。然而,由于对超树方法的依赖,DCM NJ不太可能扩展到大型数据集,因为目前没有超树方法能够高精度地扩展到大型数据集。结果:在这项研究中,我们提出了一种新的大规模系统发育估计方法,该方法具有DCM NJ的一些特征,但绕过了超树方法的使用。我们证明了这种新方法是AFC,并且使用多项式的时间和空间。此外,我们描述了这种基本方法的变化,可以与叶子不相交约束树(使用最大似然等方法计算)一起使用,以产生可能提供更好精度的其他方法。因此,我们提出了一种新的可推广的大规模树估计技术,旨在提高系统发育估计方法在超大数据集上的可扩展性,并可用于各种设置(包括来自未对齐序列的树估计,以及来自基因树的物种树估计)。
期刊介绍:
Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning.
Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms.
Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.