Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa.

IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES
César Piñeiro, Juan C Pichel
{"title":"Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa.","authors":"César Piñeiro, Juan C Pichel","doi":"10.1093/gigascience/giae055","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Phylogenies play a crucial role in biological research. Unfortunately, the search for the optimal phylogenetic tree incurs significant computational costs, and most of the existing state-of-the-art tools cannot deal with extremely large datasets in reasonable times.</p><p><strong>Results: </strong>In this work, we introduce the new VeryFastTree code (version 4.0), which is able to construct a tree on 1 server using single-precision arithmetic from a massive 1 million alignment dataset in only 36 hours, which is 3 times and 3.2 times faster than its previous version and FastTree-2, respectively. This new version further boosts performance by parallelizing all tree traversal operations during the tree construction process, including subtree pruning and regrafting moves. Additionally, it introduces significant new features such as support for new and compressed file formats, enhanced compatibility across a broader range of operating systems, and the integration of disk computing functionality. The latter feature is particularly advantageous for users without access to high-end servers, as it allows them to manage very large datasets, albeit with an increase in computing time.</p><p><strong>Conclusions: </strong>Experimental results establish VeryFastTree as the fastest tool in the state-of-the-art for maximum likelihood phylogeny estimation. It is publicly available at https://github.com/citiususc/veryfasttree. In addition, VeryFastTree is included as a package in Bioconda, MacPorts, and all Debian-based Linux distributions.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11308190/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae055","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Phylogenies play a crucial role in biological research. Unfortunately, the search for the optimal phylogenetic tree incurs significant computational costs, and most of the existing state-of-the-art tools cannot deal with extremely large datasets in reasonable times.

Results: In this work, we introduce the new VeryFastTree code (version 4.0), which is able to construct a tree on 1 server using single-precision arithmetic from a massive 1 million alignment dataset in only 36 hours, which is 3 times and 3.2 times faster than its previous version and FastTree-2, respectively. This new version further boosts performance by parallelizing all tree traversal operations during the tree construction process, including subtree pruning and regrafting moves. Additionally, it introduces significant new features such as support for new and compressed file formats, enhanced compatibility across a broader range of operating systems, and the integration of disk computing functionality. The latter feature is particularly advantageous for users without access to high-end servers, as it allows them to manage very large datasets, albeit with an increase in computing time.

Conclusions: Experimental results establish VeryFastTree as the fastest tool in the state-of-the-art for maximum likelihood phylogeny estimation. It is publicly available at https://github.com/citiususc/veryfasttree. In addition, VeryFastTree is included as a package in Bioconda, MacPorts, and all Debian-based Linux distributions.

海量分类数据集的高效系统发生树推断:利用服务器的力量分析 100 万个分类群。
背景:系统发生在生物学研究中起着至关重要的作用。遗憾的是,寻找最优的系统发生树需要大量的计算成本,而现有的大多数先进工具都无法在合理的时间内处理超大数据集:在这项工作中,我们介绍了新的 VeryFastTree 代码(4.0 版),它能够在 1 台服务器上使用单精度算术构建一棵树,而处理 100 万个庞大的比对数据集仅需 36 个小时,这分别是其旧版本和 FastTree-2 的 3 倍和 3.2 倍。新版本通过并行化树构建过程中的所有树遍历操作,包括子树修剪和重新嫁接移动,进一步提高了性能。此外,它还引入了一些重要的新功能,如支持新的压缩文件格式,增强了对更多操作系统的兼容性,并集成了磁盘计算功能。后一项功能对于无法使用高端服务器的用户来说尤其有利,因为它允许用户管理超大型数据集,尽管会增加计算时间:实验结果表明,VeryFastTree 是最先进的最大似然系统发育估计工具中速度最快的。该工具已在 https://github.com/citiususc/veryfasttree 上公开发布。此外,VeryFastTree 作为一个软件包包含在 Bioconda、MacPorts 和所有基于 Debian 的 Linux 发行版中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
GigaScience
GigaScience MULTIDISCIPLINARY SCIENCES-
CiteScore
15.50
自引率
1.10%
发文量
119
审稿时长
1 weeks
期刊介绍: GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信