Non-parametric correction of estimated gene trees using TRACTION.

IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS
Algorithms for Molecular Biology Pub Date : 2020-01-04 eCollection Date: 2020-01-01 DOI:10.1186/s13015-019-0161-8
Sarah Christensen, Erin K Molloy, Pranjal Vachaspati, Ananya Yammanuru, Tandy Warnow
{"title":"Non-parametric correction of estimated gene trees using TRACTION.","authors":"Sarah Christensen,&nbsp;Erin K Molloy,&nbsp;Pranjal Vachaspati,&nbsp;Ananya Yammanuru,&nbsp;Tandy Warnow","doi":"10.1186/s13015-019-0161-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into account. Many prior gene tree correction methods have been developed for the case where GDL is present.</p><p><strong>Results: </strong>Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to ILS and/or HGT. We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-optimal tree refinement and completion (RF-OTRC) Problem, which seeks a refinement and completion of a singly-labeled gene tree with respect to a given singly-labeled species tree so as to minimize the Robinson-Foulds (RF) distance. Our extensive simulation study on 68,000 estimated gene trees shows that TRACTION matches or improves on the accuracy of well-established methods from the GDL literature when HGT and ILS are both present, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. We also show that a naive generalization of the RF-OTRC problem to multi-labeled trees is possible, but can produce misleading results where gene tree heterogeneity is due to GDL.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"1"},"PeriodicalIF":1.5000,"publicationDate":"2020-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0161-8","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-019-0161-8","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 5

Abstract

Motivation: Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into account. Many prior gene tree correction methods have been developed for the case where GDL is present.

Results: Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to ILS and/or HGT. We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-optimal tree refinement and completion (RF-OTRC) Problem, which seeks a refinement and completion of a singly-labeled gene tree with respect to a given singly-labeled species tree so as to minimize the Robinson-Foulds (RF) distance. Our extensive simulation study on 68,000 estimated gene trees shows that TRACTION matches or improves on the accuracy of well-established methods from the GDL literature when HGT and ILS are both present, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. We also show that a naive generalization of the RF-OTRC problem to multi-labeled trees is possible, but can produce misleading results where gene tree heterogeneity is due to GDL.

Abstract Image

Abstract Image

Abstract Image

利用牵引力对估计基因树进行非参数校正。
动机:估计的基因树往往是不准确的,由于单基因比对系统发育信号不足,以及其他原因。基因树校正的目的是利用计算技术和辅助信息,如参考物种树或测序数据,提高估计基因树的准确性。然而,由于基因复制和丢失(GDL)、不完全谱系分类(ILS)和其他生物学过程,基因树和物种树可能会有所不同。因此,基因树校正方法需要考虑估计误差和基因树异质性。许多先前的基因树校正方法已开发的情况下,GDL是存在的。结果:在这里,我们研究了基因树校正问题,其中基因树异质性是由ILS和/或HGT引起的。我们引入了TRACTION,一种简单的多项式时间方法,可证明地找到RF-最优树精化和补全(RF- otrc)问题的最优解,该问题寻求单标记基因树相对于给定的单标记物种树的精化和补全,以最小化Robinson-Foulds (RF)距离。我们对68,000个估计的基因树进行了广泛的模拟研究,结果表明,当HGT和ILS同时存在时,TRACTION匹配或提高了GDL文献中成熟方法的准确性,并且在只有ILS的条件下,TRACTION的准确性最好。此外,在这些数据集上,TRACTION是最快的。我们还表明,将RF-OTRC问题简单地推广到多标记树是可能的,但可能会产生误导性的结果,其中基因树异质性是由于GDL造成的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Algorithms for Molecular Biology
Algorithms for Molecular Biology 生物-生化研究方法
CiteScore
2.40
自引率
10.00%
发文量
16
审稿时长
>12 weeks
期刊介绍: Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信