基因序列比对和基因树估计误差对基于摘要的物种网络估计的影响

Meijun Gao, Wei Wang, Kevin J. Liu
{"title":"基因序列比对和基因树估计误差对基于摘要的物种网络估计的影响","authors":"Meijun Gao, Wei Wang, Kevin J. Liu","doi":"10.1145/3535508.3545559","DOIUrl":null,"url":null,"abstract":"Thanks in part to rapid advances in next-generation sequencing technologies, recent phylogenomic studies have demonstrated the pivotal role that non-tree-like evolution plays in many parts of the Tree of Life - the evolutionary history of all life on Earth. As such, the Tree of Life is not necessarily a tree at all, but is better described by more general graph structures such as a phylogenetic network. Another key ingredient in these advances consists of the computational methods needed for reconstructing phylogenetic networks from large-scale genomic sequence data. But virtually all of these methods either require multiple sequence alignments (MSAs) as input or utilize gene trees or other inputs that are computed using MSAs. All of the input MSAs and gene trees must be estimated on empirical data. The methods themselves do not directly account for upstream estimation error, and, apart from prior studies of phylogenetic tree reconstruction and anecdotal evidence, little is understood about the impact of estimated MSA and gene tree error on downstream species network reconstruction. We therefore undertake a performance study to quantify the impact of MSA error and gene tree error on state-of-the-art phylogenetic network inference methods. Our study utilizes synthetic benchmarking data as well as genomic sequence data from mosquito and yeast. We find that upstream MSA and gene tree estimation error can have first-order effects on the accuracy of downstream network reconstruction and, to a lesser extent, its computational runtime. The effects become more pronounced on more challenging datasets with greater evolutionary divergence and more sampled taxa. Our findings highlight an important need for computational methods development: namely, scalable methods are needed to account for estimated MSA and gene tree error when reconstructing phylogenetic networks using unaligned biomolecular sequence data.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The impact of gene sequence alignment and gene tree estimation error on summary-based species network estimation\",\"authors\":\"Meijun Gao, Wei Wang, Kevin J. Liu\",\"doi\":\"10.1145/3535508.3545559\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Thanks in part to rapid advances in next-generation sequencing technologies, recent phylogenomic studies have demonstrated the pivotal role that non-tree-like evolution plays in many parts of the Tree of Life - the evolutionary history of all life on Earth. As such, the Tree of Life is not necessarily a tree at all, but is better described by more general graph structures such as a phylogenetic network. Another key ingredient in these advances consists of the computational methods needed for reconstructing phylogenetic networks from large-scale genomic sequence data. But virtually all of these methods either require multiple sequence alignments (MSAs) as input or utilize gene trees or other inputs that are computed using MSAs. All of the input MSAs and gene trees must be estimated on empirical data. The methods themselves do not directly account for upstream estimation error, and, apart from prior studies of phylogenetic tree reconstruction and anecdotal evidence, little is understood about the impact of estimated MSA and gene tree error on downstream species network reconstruction. We therefore undertake a performance study to quantify the impact of MSA error and gene tree error on state-of-the-art phylogenetic network inference methods. Our study utilizes synthetic benchmarking data as well as genomic sequence data from mosquito and yeast. We find that upstream MSA and gene tree estimation error can have first-order effects on the accuracy of downstream network reconstruction and, to a lesser extent, its computational runtime. The effects become more pronounced on more challenging datasets with greater evolutionary divergence and more sampled taxa. Our findings highlight an important need for computational methods development: namely, scalable methods are needed to account for estimated MSA and gene tree error when reconstructing phylogenetic networks using unaligned biomolecular sequence data.\",\"PeriodicalId\":354504,\"journal\":{\"name\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3535508.3545559\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

部分得益于下一代测序技术的快速发展,最近的系统基因组学研究已经证明了非树状进化在生命之树(地球上所有生命的进化史)的许多部分中起着关键作用。因此,生命之树不一定是树,而是用更一般的图结构(如系统发育网络)来更好地描述。这些进步的另一个关键因素包括从大规模基因组序列数据重建系统发育网络所需的计算方法。但实际上所有这些方法要么需要多序列比对(msa)作为输入,要么利用基因树或使用msa计算的其他输入。所有输入的msa和基因树都必须根据经验数据进行估计。这些方法本身并不能直接解释上游估计误差,而且,除了之前的系统发育树重建研究和轶事证据外,人们对估计的MSA和基因树误差对下游物种网络重建的影响知之甚少。因此,我们进行了一项性能研究,以量化MSA误差和基因树误差对最先进的系统发育网络推断方法的影响。我们的研究利用了合成基准数据以及来自蚊子和酵母的基因组序列数据。研究发现,上游MSA和基因树估计误差对下游网络重建的精度有一阶影响,并在较小程度上影响其计算运行时间。在具有更大进化差异和更多样本分类群的更具挑战性的数据集上,这种影响变得更加明显。我们的研究结果强调了计算方法发展的一个重要需求:即,在使用未对齐的生物分子序列数据重建系统发育网络时,需要可扩展的方法来考虑估计的MSA和基因树误差。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The impact of gene sequence alignment and gene tree estimation error on summary-based species network estimation
Thanks in part to rapid advances in next-generation sequencing technologies, recent phylogenomic studies have demonstrated the pivotal role that non-tree-like evolution plays in many parts of the Tree of Life - the evolutionary history of all life on Earth. As such, the Tree of Life is not necessarily a tree at all, but is better described by more general graph structures such as a phylogenetic network. Another key ingredient in these advances consists of the computational methods needed for reconstructing phylogenetic networks from large-scale genomic sequence data. But virtually all of these methods either require multiple sequence alignments (MSAs) as input or utilize gene trees or other inputs that are computed using MSAs. All of the input MSAs and gene trees must be estimated on empirical data. The methods themselves do not directly account for upstream estimation error, and, apart from prior studies of phylogenetic tree reconstruction and anecdotal evidence, little is understood about the impact of estimated MSA and gene tree error on downstream species network reconstruction. We therefore undertake a performance study to quantify the impact of MSA error and gene tree error on state-of-the-art phylogenetic network inference methods. Our study utilizes synthetic benchmarking data as well as genomic sequence data from mosquito and yeast. We find that upstream MSA and gene tree estimation error can have first-order effects on the accuracy of downstream network reconstruction and, to a lesser extent, its computational runtime. The effects become more pronounced on more challenging datasets with greater evolutionary divergence and more sampled taxa. Our findings highlight an important need for computational methods development: namely, scalable methods are needed to account for estimated MSA and gene tree error when reconstructing phylogenetic networks using unaligned biomolecular sequence data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信