Sequential Bayesian Phylogenetic Inference.

IF 6.1 1区生物学 Q1 EVOLUTIONARY BIOLOGY

Systematic Biology Pub Date : 2024-10-25 DOI:10.1093/sysbio/syae020

Sebastian Höhna, Allison Y Hsiang

{"title":"Sequential Bayesian Phylogenetic Inference.","authors":"Sebastian Höhna, Allison Y Hsiang","doi":"10.1093/sysbio/syae020","DOIUrl":null,"url":null,"abstract":"<p><p>The ideal approach to Bayesian phylogenetic inference is to estimate all parameters of interest jointly in a single hierarchical model. However, this is often not feasible in practice due to the high computational cost. Instead, phylogenetic pipelines generally consist of sequential analyses, whereby a single point estimate from a given analysis is used as input for the next analysis (e.g., a single multiple sequence alignment is used to estimate a gene tree). In this framework, uncertainty is not propagated from step to step, which can lead to inaccurate or spuriously confident results. Here, we formally develop and test a sequential inference approach for Bayesian phylogenetic inference, which uses importance sampling to generate observations for the next step of an analysis pipeline from the posterior distribution produced in the previous step. Our sequential inference approach presented here not only accounts for uncertainty between analysis steps but also allows for greater flexibility in software choice (and hence model availability) and can be computationally more efficient than the traditional joint inference approach when multiple models are being tested. We show that our sequential inference approach is identical in practice to the joint inference approach only if sufficient information in the data is present (a narrow posterior distribution) and/or sufficiently many important samples are used. Conversely, we show that the common practice of using a single point estimate can be biased, for example, a single phylogeny estimate can transform an unrooted phylogeny into a time-calibrated phylogeny. We demonstrate the theory of sequential Bayesian inference using both a toy example and an empirical case study of divergence-time estimation in insects using a relaxed clock model from transcriptome data. In the empirical example, we estimate 3 posterior distributions of branch lengths from the same data (DNA character matrix with a GTR+Γ+I substitution model, an amino acid data matrix with empirical substitution models, and an amino acid data matrix with the PhyloBayes CAT-GTR model). Finally, we apply 3 different node-calibration strategies and show that divergence time estimates are affected by both the data source and underlying substitution process to estimate branch lengths as well as the node-calibration strategies. Thus, our new sequential Bayesian phylogenetic inference provides the opportunity to efficiently test different approaches for divergence time estimation, including branch-length estimation from other software.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"704-721"},"PeriodicalIF":6.1000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syae020","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The ideal approach to Bayesian phylogenetic inference is to estimate all parameters of interest jointly in a single hierarchical model. However, this is often not feasible in practice due to the high computational cost. Instead, phylogenetic pipelines generally consist of sequential analyses, whereby a single point estimate from a given analysis is used as input for the next analysis (e.g., a single multiple sequence alignment is used to estimate a gene tree). In this framework, uncertainty is not propagated from step to step, which can lead to inaccurate or spuriously confident results. Here, we formally develop and test a sequential inference approach for Bayesian phylogenetic inference, which uses importance sampling to generate observations for the next step of an analysis pipeline from the posterior distribution produced in the previous step. Our sequential inference approach presented here not only accounts for uncertainty between analysis steps but also allows for greater flexibility in software choice (and hence model availability) and can be computationally more efficient than the traditional joint inference approach when multiple models are being tested. We show that our sequential inference approach is identical in practice to the joint inference approach only if sufficient information in the data is present (a narrow posterior distribution) and/or sufficiently many important samples are used. Conversely, we show that the common practice of using a single point estimate can be biased, for example, a single phylogeny estimate can transform an unrooted phylogeny into a time-calibrated phylogeny. We demonstrate the theory of sequential Bayesian inference using both a toy example and an empirical case study of divergence-time estimation in insects using a relaxed clock model from transcriptome data. In the empirical example, we estimate 3 posterior distributions of branch lengths from the same data (DNA character matrix with a GTR+Γ+I substitution model, an amino acid data matrix with empirical substitution models, and an amino acid data matrix with the PhyloBayes CAT-GTR model). Finally, we apply 3 different node-calibration strategies and show that divergence time estimates are affected by both the data source and underlying substitution process to estimate branch lengths as well as the node-calibration strategies. Thus, our new sequential Bayesian phylogenetic inference provides the opportunity to efficiently test different approaches for divergence time estimation, including branch-length estimation from other software.

查看原文本刊更多论文

序列贝叶斯系统发育推论

贝叶斯系统发育推断的理想方法是在单一分层模型中联合估计所有相关参数。然而，由于计算成本较高，这在实践中往往并不可行。取而代之的是，系统发育管道一般由连续分析组成，即把给定分析中的单点估计值作为下一步分析的输入（例如，用单个多序列比对来估计基因树）。在这个框架中，不确定性不会从一个步骤传播到另一个步骤，这可能导致不准确或虚假的可信结果。在这里，我们正式开发并测试了一种贝叶斯系统发育推断的顺序推断方法，该方法使用重要性采样从上一步产生的后验分布中为下一步分析流水线生成观测值。我们在此介绍的顺序推断方法不仅考虑了分析步骤之间的不确定性，而且在软件选择（从而模型可用性）方面具有更大的灵活性，并且在测试多个模型时比传统的联合推断方法计算效率更高。我们的研究表明，只有当数据中存在足够的信息（窄后验分布）和/或使用了足够多的重要性样本时，我们的顺序推断方法在实践中才与联合推断方法相同。相反，我们证明了使用单点估计的常见做法可能存在偏差，例如，使用单个系统发育估计将未根系统发育转化为时间校准系统发育。我们通过一个玩具示例和一个实证案例研究证明了序列贝叶斯推断理论，即利用转录组数据中的松弛时钟模型对昆虫的分化时间进行估计。在经验示例中，我们从相同的数据（采用 GTR+Γ+I 替代模型的 DNA 特征矩阵、采用经验替代模型的氨基酸数据矩阵和采用 PhyloBayes CAT-GTR 模型的氨基酸数据矩阵）中估计了三个分支长度的后验分布。最后，我们应用了三种不同的节点校准策略，结果表明分歧时间估计值既受数据源和基础替代过程的影响，也受估计分支长度的节点校准策略的影响。因此，我们新的序列贝叶斯系统发育推断方法为有效测试不同的分歧时间估计方法（包括其他软件的分支长度估计方法）提供了机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Systematic Biology 生物-进化生物学

CiteScore

13.00

自引率

7.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.