{"title":"通过提前停止以逃避(过度)优化加速最大似然系统发育推断","authors":"Anastasis Togkousidis, Alexandros Stamatakis, Olivier Gascuel","doi":"10.1093/sysbio/syaf043","DOIUrl":null,"url":null,"abstract":"Maximum Likelihood (ML) based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the tree topology, the branch-lengths, and the evolutionary model parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Since input sequences are typically subject to stochastic and systematic noise, caution is warranted to prevent over-optimization and the risk of overfitting the model to noisy data. To address this, we integrate the Kishino-Hasegawa (KH) test into RAxML-NG as a reliable and fast-to-compute Early Stopping criterion to effectively limit excessive and compute-intensive over-optimization. Initially, we introduce a simplified heuristic tree search strategy in RAxML-NG (sRAxML-NG) as an underlying method for Early Stopping. Subsequently, we use the KH test in combination with sRAxML-NG, to statistically assess the significance of differences between intermediate trees prior to and after major optimization steps. The tree search terminates early when improvements are statistically insignificant. We also propose an extension to the standard KH test that allows to correct for multiple testing, which maintains accuracy while achieving even higher speedups. For benchmarking we use 300 large representative empirical datasets from TreeBASE. For 98% of the DNA datasets, all Early Stopping methods we introduce infer trees that are statistically equivalent to those inferred from RAxML-NG v1.2. For AA datasets, the fraction of datasets where sRAxML-NG, KH, and the KH-multiple testing versions infer statistically equivalent trees is 96%, 95%, and 92%, respectively. In conjuction with sRAxML-NG, the average speedup achieved by the KH-multiple testing version is 5x for DNA and 3.9x for protein datasets compared to RAxML-NG v1.2. We implemented our stopping criteria in RAxML-NG, which is available under GNU GPL at https://github.com/togkousa/raxml-ng/tree/stopping-criteria.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"12 1","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating Maximum Likelihood Phylogenetic Inference via Early Stopping to Evade (Over-)optimization\",\"authors\":\"Anastasis Togkousidis, Alexandros Stamatakis, Olivier Gascuel\",\"doi\":\"10.1093/sysbio/syaf043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Maximum Likelihood (ML) based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the tree topology, the branch-lengths, and the evolutionary model parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Since input sequences are typically subject to stochastic and systematic noise, caution is warranted to prevent over-optimization and the risk of overfitting the model to noisy data. To address this, we integrate the Kishino-Hasegawa (KH) test into RAxML-NG as a reliable and fast-to-compute Early Stopping criterion to effectively limit excessive and compute-intensive over-optimization. Initially, we introduce a simplified heuristic tree search strategy in RAxML-NG (sRAxML-NG) as an underlying method for Early Stopping. Subsequently, we use the KH test in combination with sRAxML-NG, to statistically assess the significance of differences between intermediate trees prior to and after major optimization steps. The tree search terminates early when improvements are statistically insignificant. We also propose an extension to the standard KH test that allows to correct for multiple testing, which maintains accuracy while achieving even higher speedups. For benchmarking we use 300 large representative empirical datasets from TreeBASE. For 98% of the DNA datasets, all Early Stopping methods we introduce infer trees that are statistically equivalent to those inferred from RAxML-NG v1.2. For AA datasets, the fraction of datasets where sRAxML-NG, KH, and the KH-multiple testing versions infer statistically equivalent trees is 96%, 95%, and 92%, respectively. In conjuction with sRAxML-NG, the average speedup achieved by the KH-multiple testing version is 5x for DNA and 3.9x for protein datasets compared to RAxML-NG v1.2. We implemented our stopping criteria in RAxML-NG, which is available under GNU GPL at https://github.com/togkousa/raxml-ng/tree/stopping-criteria.\",\"PeriodicalId\":22120,\"journal\":{\"name\":\"Systematic Biology\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Systematic Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/sysbio/syaf043\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EVOLUTIONARY BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syaf043","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}
Accelerating Maximum Likelihood Phylogenetic Inference via Early Stopping to Evade (Over-)optimization
Maximum Likelihood (ML) based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the tree topology, the branch-lengths, and the evolutionary model parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Since input sequences are typically subject to stochastic and systematic noise, caution is warranted to prevent over-optimization and the risk of overfitting the model to noisy data. To address this, we integrate the Kishino-Hasegawa (KH) test into RAxML-NG as a reliable and fast-to-compute Early Stopping criterion to effectively limit excessive and compute-intensive over-optimization. Initially, we introduce a simplified heuristic tree search strategy in RAxML-NG (sRAxML-NG) as an underlying method for Early Stopping. Subsequently, we use the KH test in combination with sRAxML-NG, to statistically assess the significance of differences between intermediate trees prior to and after major optimization steps. The tree search terminates early when improvements are statistically insignificant. We also propose an extension to the standard KH test that allows to correct for multiple testing, which maintains accuracy while achieving even higher speedups. For benchmarking we use 300 large representative empirical datasets from TreeBASE. For 98% of the DNA datasets, all Early Stopping methods we introduce infer trees that are statistically equivalent to those inferred from RAxML-NG v1.2. For AA datasets, the fraction of datasets where sRAxML-NG, KH, and the KH-multiple testing versions infer statistically equivalent trees is 96%, 95%, and 92%, respectively. In conjuction with sRAxML-NG, the average speedup achieved by the KH-multiple testing version is 5x for DNA and 3.9x for protein datasets compared to RAxML-NG v1.2. We implemented our stopping criteria in RAxML-NG, which is available under GNU GPL at https://github.com/togkousa/raxml-ng/tree/stopping-criteria.
期刊介绍:
Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.