Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy.

IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS
Algorithms for Molecular Biology Pub Date : 2019-02-06 eCollection Date: 2019-01-01 DOI:10.1186/s13015-019-0136-9
Qiuyi Zhang, Satish Rao, Tandy Warnow
{"title":"Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy.","authors":"Qiuyi Zhang,&nbsp;Satish Rao,&nbsp;Tandy Warnow","doi":"10.1186/s13015-019-0136-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was <math><mrow><mi>D</mi> <mi>C</mi> <msub><mi>M</mi> <mrow><mi>NJ</mi></mrow> </msub> <mo>,</mo></mrow> </math> published in SODA 2001. The main empirical advantage of <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> over other AFC methods is its use of neighbor joining (<i>NJ</i>) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy.</p><p><strong>Results: </strong>In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"2"},"PeriodicalIF":1.5000,"publicationDate":"2019-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0136-9","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-019-0136-9","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 12

Abstract

Background: Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was D C M NJ , published in SODA 2001. The main empirical advantage of DCM NJ over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, DCM NJ is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy.

Results: In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCM NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).

Abstract Image

Abstract Image

Abstract Image

约束增量树构建:新的绝对快速收敛系统发育估计方法,提高了可扩展性和准确性。
背景:绝对快速收敛(AFC)系统发育估计方法是一种已经被证明可以高概率恢复真树的方法,给定序列的长度是树中叶子数量的多项式(一旦最短和最长的分支权值固定)。虽然有大量关于AFC方法的文献,但就经验表现而言,最好的是发表在SODA 2001上的D C M NJ。与其他AFC方法相比,DCM NJ的主要经验优势在于它使用邻居连接(NJ)在较小的分类群子集上构建树,然后使用超树方法将这些树组合成完整的物种集上的树;相比之下,其他AFC方法本质上依赖于相互独立计算的四重奏树,与邻居连接相比,这降低了精度。然而,由于对超树方法的依赖,DCM NJ不太可能扩展到大型数据集,因为目前没有超树方法能够高精度地扩展到大型数据集。结果:在这项研究中,我们提出了一种新的大规模系统发育估计方法,该方法具有DCM NJ的一些特征,但绕过了超树方法的使用。我们证明了这种新方法是AFC,并且使用多项式的时间和空间。此外,我们描述了这种基本方法的变化,可以与叶子不相交约束树(使用最大似然等方法计算)一起使用,以产生可能提供更好精度的其他方法。因此,我们提出了一种新的可推广的大规模树估计技术,旨在提高系统发育估计方法在超大数据集上的可扩展性,并可用于各种设置(包括来自未对齐序列的树估计,以及来自基因树的物种树估计)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Algorithms for Molecular Biology
Algorithms for Molecular Biology 生物-生化研究方法
CiteScore
2.40
自引率
10.00%
发文量
16
审稿时长
>12 weeks
期刊介绍: Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信