{"title":"Expectation-Maximization enables Phylogenetic Dating under a Categorical Rate Model.","authors":"Uyen Mai, Eduardo Charvel, Siavash Mirarab","doi":"10.1093/sysbio/syae034","DOIUrl":null,"url":null,"abstract":"<p><p>Dating phylogenetic trees to obtain branch lengths in time units is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a distribution of branch rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification, where the assumed parametric statistical distribution of branch rates vastly differs from the true distribution. Notably, most existing methods assume rigid, often unimodal, branch rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates, often leading to difficult non-convex optimization problems. To tackle both challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization algorithm to co-estimate rate categories and branch lengths in time units. Our model has fewer assumptions about the true distribution of branch rates than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with exponential or multimodal rate distributions.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"823-838"},"PeriodicalIF":6.1000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11524793/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syae034","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Dating phylogenetic trees to obtain branch lengths in time units is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a distribution of branch rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification, where the assumed parametric statistical distribution of branch rates vastly differs from the true distribution. Notably, most existing methods assume rigid, often unimodal, branch rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates, often leading to difficult non-convex optimization problems. To tackle both challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization algorithm to co-estimate rate categories and branch lengths in time units. Our model has fewer assumptions about the true distribution of branch rates than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with exponential or multimodal rate distributions.
对系统发生树进行定年以获得时间单位的分支长度对许多下游应用都是至关重要的,但仍然具有挑战性。确定系统发生树的年代需要推断整个系统发生树中可能发生变化的替代率。虽然我们可以假设从化石记录或取样时间(对于快速进化的生物)中获得了一小部分节点的信息,但推断其他节点的年龄基本上需要外推法和内插法。假设分支率的分布情况,我们可以将年代测定表述为一个受约束的最大似然(ML)估计问题。虽然存在最大似然法测年方法,但其准确性会因模型失当而降低,因为在模型失当的情况下,假定的分支率参数统计分布与真实分布相差甚远。值得注意的是,大多数现有方法都假设了僵化的、通常是单模态的分支率分布。第二个挑战是,似然函数涉及对比率连续域的积分,通常会导致困难的非凸优化问题。为了解决这两个难题,我们提出了一种名为 "使用分类模型的分子约会"(MD-Cat)的新方法。MD-Cat 采用了一种受非参数统计启发的速率分类模型,通过将速率分布离散为 k 个类别,可以近似大量的模型族。在此模型下,我们可以使用期望最大化(EM)算法来共同估算速率类别和以时间为单位的分支长度。与伽马分布或对数正态分布等参数模型相比,我们的模型对分支率真实分布的假设更少。我们在 Angiosperms 和 HIV 两个模拟和真实数据集以及多种速率分布选择上的结果表明,MD-Cat 通常比其他方法更准确,尤其是在指数或多模态速率分布的数据集上。
期刊介绍:
Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.