海量分类数据集的高效系统发生树推断：利用服务器的力量分析 100 万个分类群。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience Pub Date : 2024-01-02 DOI:10.1093/gigascience/giae055

César Piñeiro, Juan C Pichel

{"title":"海量分类数据集的高效系统发生树推断：利用服务器的力量分析 100 万个分类群。","authors":"César Piñeiro, Juan C Pichel","doi":"10.1093/gigascience/giae055","DOIUrl":null,"url":null,"abstract":"Background: Phylogenies play a crucial role in biological research. Unfortunately, the search for the optimal phylogenetic tree incurs significant computational costs, and most of the existing state-of-the-art tools cannot deal with extremely large datasets in reasonable times.Results: In this work, we introduce the new VeryFastTree code (version 4.0), which is able to construct a tree on 1 server using single-precision arithmetic from a massive 1 million alignment dataset in only 36 hours, which is 3 times and 3.2 times faster than its previous version and FastTree-2, respectively. This new version further boosts performance by parallelizing all tree traversal operations during the tree construction process, including subtree pruning and regrafting moves. Additionally, it introduces significant new features such as support for new and compressed file formats, enhanced compatibility across a broader range of operating systems, and the integration of disk computing functionality. The latter feature is particularly advantageous for users without access to high-end servers, as it allows them to manage very large datasets, albeit with an increase in computing time.Conclusions: Experimental results establish VeryFastTree as the fastest tool in the state-of-the-art for maximum likelihood phylogeny estimation. It is publicly available at https://github.com/citiususc/veryfasttree. In addition, VeryFastTree is included as a package in Bioconda, MacPorts, and all Debian-based Linux distributions.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11308190/pdf/","citationCount":"0","resultStr":"{\"title\":\"Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa.\",\"authors\":\"César Piñeiro, Juan C Pichel\",\"doi\":\"10.1093/gigascience/giae055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Phylogenies play a crucial role in biological research. Unfortunately, the search for the optimal phylogenetic tree incurs significant computational costs, and most of the existing state-of-the-art tools cannot deal with extremely large datasets in reasonable times.Results: In this work, we introduce the new VeryFastTree code (version 4.0), which is able to construct a tree on 1 server using single-precision arithmetic from a massive 1 million alignment dataset in only 36 hours, which is 3 times and 3.2 times faster than its previous version and FastTree-2, respectively. This new version further boosts performance by parallelizing all tree traversal operations during the tree construction process, including subtree pruning and regrafting moves. Additionally, it introduces significant new features such as support for new and compressed file formats, enhanced compatibility across a broader range of operating systems, and the integration of disk computing functionality. The latter feature is particularly advantageous for users without access to high-end servers, as it allows them to manage very large datasets, albeit with an increase in computing time.Conclusions: Experimental results establish VeryFastTree as the fastest tool in the state-of-the-art for maximum likelihood phylogeny estimation. It is publicly available at https://github.com/citiususc/veryfasttree. In addition, VeryFastTree is included as a package in Bioconda, MacPorts, and all Debian-based Linux distributions.\",\"PeriodicalId\":12581,\"journal\":{\"name\":\"GigaScience\",\"volume\":\"13 \",\"pages\":\"\"},\"PeriodicalIF\":11.8000,\"publicationDate\":\"2024-01-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11308190/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GigaScience\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/gigascience/giae055\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae055","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

背景：系统发生在生物学研究中起着至关重要的作用。遗憾的是，寻找最优的系统发生树需要大量的计算成本，而现有的大多数先进工具都无法在合理的时间内处理超大数据集：在这项工作中，我们介绍了新的 VeryFastTree 代码（4.0 版），它能够在 1 台服务器上使用单精度算术构建一棵树，而处理 100 万个庞大的比对数据集仅需 36 个小时，这分别是其旧版本和 FastTree-2 的 3 倍和 3.2 倍。新版本通过并行化树构建过程中的所有树遍历操作，包括子树修剪和重新嫁接移动，进一步提高了性能。此外，它还引入了一些重要的新功能，如支持新的压缩文件格式，增强了对更多操作系统的兼容性，并集成了磁盘计算功能。后一项功能对于无法使用高端服务器的用户来说尤其有利，因为它允许用户管理超大型数据集，尽管会增加计算时间：实验结果表明，VeryFastTree 是最先进的最大似然系统发育估计工具中速度最快的。该工具已在 https://github.com/citiususc/veryfasttree 上公开发布。此外，VeryFastTree 作为一个软件包包含在 Bioconda、MacPorts 和所有基于 Debian 的 Linux 发行版中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa.

Background: Phylogenies play a crucial role in biological research. Unfortunately, the search for the optimal phylogenetic tree incurs significant computational costs, and most of the existing state-of-the-art tools cannot deal with extremely large datasets in reasonable times.

Results: In this work, we introduce the new VeryFastTree code (version 4.0), which is able to construct a tree on 1 server using single-precision arithmetic from a massive 1 million alignment dataset in only 36 hours, which is 3 times and 3.2 times faster than its previous version and FastTree-2, respectively. This new version further boosts performance by parallelizing all tree traversal operations during the tree construction process, including subtree pruning and regrafting moves. Additionally, it introduces significant new features such as support for new and compressed file formats, enhanced compatibility across a broader range of operating systems, and the integration of disk computing functionality. The latter feature is particularly advantageous for users without access to high-end servers, as it allows them to manage very large datasets, albeit with an increase in computing time.

Conclusions: Experimental results establish VeryFastTree as the fastest tool in the state-of-the-art for maximum likelihood phylogeny estimation. It is publicly available at https://github.com/citiususc/veryfasttree. In addition, VeryFastTree is included as a package in Bioconda, MacPorts, and all Debian-based Linux distributions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.