通过提前停止以逃避（过度）优化加速最大似然系统发育推断

IF 5.7 1区生物学 Q1 EVOLUTIONARY BIOLOGY

Systematic Biology Pub Date : 2025-05-30 DOI:10.1093/sysbio/syaf043

Anastasis Togkousidis, Alexandros Stamatakis, Olivier Gascuel

{"title":"通过提前停止以逃避（过度）优化加速最大似然系统发育推断","authors":"Anastasis Togkousidis, Alexandros Stamatakis, Olivier Gascuel","doi":"10.1093/sysbio/syaf043","DOIUrl":null,"url":null,"abstract":"Maximum Likelihood (ML) based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the tree topology, the branch-lengths, and the evolutionary model parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Since input sequences are typically subject to stochastic and systematic noise, caution is warranted to prevent over-optimization and the risk of overfitting the model to noisy data. To address this, we integrate the Kishino-Hasegawa (KH) test into RAxML-NG as a reliable and fast-to-compute Early Stopping criterion to effectively limit excessive and compute-intensive over-optimization. Initially, we introduce a simplified heuristic tree search strategy in RAxML-NG (sRAxML-NG) as an underlying method for Early Stopping. Subsequently, we use the KH test in combination with sRAxML-NG, to statistically assess the significance of differences between intermediate trees prior to and after major optimization steps. The tree search terminates early when improvements are statistically insignificant. We also propose an extension to the standard KH test that allows to correct for multiple testing, which maintains accuracy while achieving even higher speedups. For benchmarking we use 300 large representative empirical datasets from TreeBASE. For 98% of the DNA datasets, all Early Stopping methods we introduce infer trees that are statistically equivalent to those inferred from RAxML-NG v1.2. For AA datasets, the fraction of datasets where sRAxML-NG, KH, and the KH-multiple testing versions infer statistically equivalent trees is 96%, 95%, and 92%, respectively. In conjuction with sRAxML-NG, the average speedup achieved by the KH-multiple testing version is 5x for DNA and 3.9x for protein datasets compared to RAxML-NG v1.2. We implemented our stopping criteria in RAxML-NG, which is available under GNU GPL at https://github.com/togkousa/raxml-ng/tree/stopping-criteria.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"12 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating Maximum Likelihood Phylogenetic Inference via Early Stopping to Evade (Over-)optimization\",\"authors\":\"Anastasis Togkousidis, Alexandros Stamatakis, Olivier Gascuel\",\"doi\":\"10.1093/sysbio/syaf043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Maximum Likelihood (ML) based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the tree topology, the branch-lengths, and the evolutionary model parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Since input sequences are typically subject to stochastic and systematic noise, caution is warranted to prevent over-optimization and the risk of overfitting the model to noisy data. To address this, we integrate the Kishino-Hasegawa (KH) test into RAxML-NG as a reliable and fast-to-compute Early Stopping criterion to effectively limit excessive and compute-intensive over-optimization. Initially, we introduce a simplified heuristic tree search strategy in RAxML-NG (sRAxML-NG) as an underlying method for Early Stopping. Subsequently, we use the KH test in combination with sRAxML-NG, to statistically assess the significance of differences between intermediate trees prior to and after major optimization steps. The tree search terminates early when improvements are statistically insignificant. We also propose an extension to the standard KH test that allows to correct for multiple testing, which maintains accuracy while achieving even higher speedups. For benchmarking we use 300 large representative empirical datasets from TreeBASE. For 98% of the DNA datasets, all Early Stopping methods we introduce infer trees that are statistically equivalent to those inferred from RAxML-NG v1.2. For AA datasets, the fraction of datasets where sRAxML-NG, KH, and the KH-multiple testing versions infer statistically equivalent trees is 96%, 95%, and 92%, respectively. In conjuction with sRAxML-NG, the average speedup achieved by the KH-multiple testing version is 5x for DNA and 3.9x for protein datasets compared to RAxML-NG v1.2. We implemented our stopping criteria in RAxML-NG, which is available under GNU GPL at https://github.com/togkousa/raxml-ng/tree/stopping-criteria.\",\"PeriodicalId\":22120,\"journal\":{\"name\":\"Systematic Biology\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Systematic Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/sysbio/syaf043\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EVOLUTIONARY BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syaf043","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

基于最大似然（ML）的系统发育推理是一个具有挑战性的优化问题。给定一组对齐的输入序列，系统发育推断工具努力确定树的拓扑结构、分支长度和最大化系统发育似然函数的进化模型参数。然而，存在令人信服的理由，不推动优化到其极限，通过早期，但适当的停止标准。由于输入序列通常受到随机和系统噪声的影响，因此需要谨慎，以防止过度优化和模型过度拟合到噪声数据的风险。为了解决这个问题，我们将Kishino-Hasegawa （KH）测试集成到RAxML-NG中，作为一个可靠且快速计算的早期停止标准，以有效地限制过度和计算密集型的过度优化。首先，我们在RAxML-NG （sRAxML-NG）中引入了一种简化的启发式树搜索策略，作为早期停止的基础方法。随后，我们将KH检验与sRAxML-NG结合使用，统计评估主要优化步骤前后中间树之间差异的显著性。当改进在统计上不显著时，树搜索会提前终止。我们还建议对标准KH测试进行扩展，允许对多个测试进行校正，从而在保持准确性的同时实现更高的速度。为了进行基准测试，我们使用了来自TreeBASE的300个大型代表性经验数据集。对于98%的DNA数据集，我们引入的所有早期停止方法推断的树在统计上等同于从RAxML-NG v1.2推断的树。对于AA数据集，sRAxML-NG、KH和KH-multiple测试版本推断出统计等效树的数据集的比例分别为96%、95%和92%。结合sRAxML-NG，与RAxML-NG v1.2相比，KH-multiple测试版本对DNA的平均加速是5倍，对蛋白质数据集的平均加速是3.9倍。我们在RAxML-NG中实现了我们的停止标准，该标准可在GNU GPL下从https://github.com/togkousa/raxml-ng/tree/stopping-criteria获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating Maximum Likelihood Phylogenetic Inference via Early Stopping to Evade (Over-)optimization

Maximum Likelihood (ML) based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the tree topology, the branch-lengths, and the evolutionary model parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Since input sequences are typically subject to stochastic and systematic noise, caution is warranted to prevent over-optimization and the risk of overfitting the model to noisy data. To address this, we integrate the Kishino-Hasegawa (KH) test into RAxML-NG as a reliable and fast-to-compute Early Stopping criterion to effectively limit excessive and compute-intensive over-optimization. Initially, we introduce a simplified heuristic tree search strategy in RAxML-NG (sRAxML-NG) as an underlying method for Early Stopping. Subsequently, we use the KH test in combination with sRAxML-NG, to statistically assess the significance of differences between intermediate trees prior to and after major optimization steps. The tree search terminates early when improvements are statistically insignificant. We also propose an extension to the standard KH test that allows to correct for multiple testing, which maintains accuracy while achieving even higher speedups. For benchmarking we use 300 large representative empirical datasets from TreeBASE. For 98% of the DNA datasets, all Early Stopping methods we introduce infer trees that are statistically equivalent to those inferred from RAxML-NG v1.2. For AA datasets, the fraction of datasets where sRAxML-NG, KH, and the KH-multiple testing versions infer statistically equivalent trees is 96%, 95%, and 92%, respectively. In conjuction with sRAxML-NG, the average speedup achieved by the KH-multiple testing version is 5x for DNA and 3.9x for protein datasets compared to RAxML-NG v1.2. We implemented our stopping criteria in RAxML-NG, which is available under GNU GPL at https://github.com/togkousa/raxml-ng/tree/stopping-criteria.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Systematic Biology 生物-进化生物学

CiteScore

13.00

自引率

7.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.