{"title":"外源序列重建:通过外源序列交叉验证评估祖先序列重建的准确性","authors":"Michael A. Sennett, Douglas L. Theobald","doi":"10.1007/s00239-024-10162-3","DOIUrl":null,"url":null,"abstract":"<p>Ancestral sequence reconstruction (ASR) is a phylogenetic method widely used to analyze the properties of ancient biomolecules and to elucidate mechanisms of molecular evolution. Despite its increasingly widespread application, the accuracy of ASR is currently unknown, as it is generally impossible to compare resurrected proteins to the true ancestors. Which evolutionary models are best for ASR? How accurate are the resulting inferences? Here we answer these questions using a cross-validation method to reconstruct each extant sequence in an alignment with ASR methodology, a method we term “extant sequence reconstruction” (ESR). We thus can evaluate the accuracy of ASR methodology by comparing ESR reconstructions to the corresponding known true sequences. We find that a common measure of the quality of a reconstructed sequence, the average probability, is indeed a good estimate of the fraction of correct amino acids when the evolutionary model is accurate or overparameterized. However, the average probability is a poor measure for comparing reconstructions from different models, because, surprisingly, a more accurate phylogenetic model often results in reconstructions with lower probability. While better (more predictive) models may produce reconstructions with lower sequence identity to the true sequences, better models nevertheless produce reconstructions that are more biophysically similar to true ancestors. In addition, we find that a large fraction of sequences sampled from the reconstruction distribution may have fewer errors than the single most probable (SMP) sequence reconstruction, despite the fact that the SMP has the lowest expected error of all possible sequences. Our results emphasize the importance of model selection for ASR and the usefulness of sampling sequence reconstructions for analyzing ancestral protein properties. ESR is a powerful method for validating the evolutionary models used for ASR and can be applied in practice to any phylogenetic analysis of real biological sequences. Most significantly, ESR uses ASR methodology to provide a general method by which the biophysical properties of resurrected proteins can be compared to the properties of the true protein.</p>","PeriodicalId":16366,"journal":{"name":"Journal of Molecular Evolution","volume":"23 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extant Sequence Reconstruction: The Accuracy of Ancestral Sequence Reconstructions Evaluated by Extant Sequence Cross-Validation\",\"authors\":\"Michael A. Sennett, Douglas L. Theobald\",\"doi\":\"10.1007/s00239-024-10162-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Ancestral sequence reconstruction (ASR) is a phylogenetic method widely used to analyze the properties of ancient biomolecules and to elucidate mechanisms of molecular evolution. Despite its increasingly widespread application, the accuracy of ASR is currently unknown, as it is generally impossible to compare resurrected proteins to the true ancestors. Which evolutionary models are best for ASR? How accurate are the resulting inferences? Here we answer these questions using a cross-validation method to reconstruct each extant sequence in an alignment with ASR methodology, a method we term “extant sequence reconstruction” (ESR). We thus can evaluate the accuracy of ASR methodology by comparing ESR reconstructions to the corresponding known true sequences. We find that a common measure of the quality of a reconstructed sequence, the average probability, is indeed a good estimate of the fraction of correct amino acids when the evolutionary model is accurate or overparameterized. However, the average probability is a poor measure for comparing reconstructions from different models, because, surprisingly, a more accurate phylogenetic model often results in reconstructions with lower probability. While better (more predictive) models may produce reconstructions with lower sequence identity to the true sequences, better models nevertheless produce reconstructions that are more biophysically similar to true ancestors. In addition, we find that a large fraction of sequences sampled from the reconstruction distribution may have fewer errors than the single most probable (SMP) sequence reconstruction, despite the fact that the SMP has the lowest expected error of all possible sequences. Our results emphasize the importance of model selection for ASR and the usefulness of sampling sequence reconstructions for analyzing ancestral protein properties. ESR is a powerful method for validating the evolutionary models used for ASR and can be applied in practice to any phylogenetic analysis of real biological sequences. Most significantly, ESR uses ASR methodology to provide a general method by which the biophysical properties of resurrected proteins can be compared to the properties of the true protein.</p>\",\"PeriodicalId\":16366,\"journal\":{\"name\":\"Journal of Molecular Evolution\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Molecular Evolution\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1007/s00239-024-10162-3\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Molecular Evolution","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s00239-024-10162-3","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
祖先序列重建(ASR)是一种系统发生学方法,广泛用于分析古代生物大分子的特性和阐明分子进化机制。尽管ASR的应用越来越广泛,但其准确性目前尚不清楚,因为通常无法将复活的蛋白质与真正的祖先进行比较。哪种进化模型最适合 ASR?由此得出的推论准确度如何?在这里,我们用交叉验证的方法来回答这些问题,用 ASR 方法重建排列中的每个现存序列,我们称这种方法为 "现存序列重建"(ESR)。因此,我们可以通过比较 ESR 重建与相应的已知真实序列来评估 ASR 方法的准确性。我们发现,在进化模型准确或过度参数化的情况下,衡量重建序列质量的常用指标--平均概率,确实是对正确氨基酸比例的良好估计。然而,在比较不同模型的重建结果时,平均概率并不是一个好的衡量标准,因为令人惊讶的是,更准确的系统进化模型往往会导致重建结果的概率更低。虽然更好(更具预测性)的模型可能会产生与真实序列具有较低序列同一性的重建结果,但更好的模型所产生的重建结果在生物物理上与真实祖先更为相似。此外,我们还发现,从重建分布中采样的大部分序列可能比单一最可能(SMP)序列重建的误差更小,尽管事实上 SMP 在所有可能序列中具有最低的预期误差。我们的研究结果强调了模型选择对 ASR 的重要性,以及取样序列重建对分析祖先蛋白质特性的有用性。ESR 是验证 ASR 所用进化模型的有力方法,可实际应用于任何真实生物序列的系统发育分析。最重要的是,ESR 利用 ASR 方法提供了一种通用方法,可将复活蛋白质的生物物理特性与真实蛋白质的特性进行比较。
Extant Sequence Reconstruction: The Accuracy of Ancestral Sequence Reconstructions Evaluated by Extant Sequence Cross-Validation
Ancestral sequence reconstruction (ASR) is a phylogenetic method widely used to analyze the properties of ancient biomolecules and to elucidate mechanisms of molecular evolution. Despite its increasingly widespread application, the accuracy of ASR is currently unknown, as it is generally impossible to compare resurrected proteins to the true ancestors. Which evolutionary models are best for ASR? How accurate are the resulting inferences? Here we answer these questions using a cross-validation method to reconstruct each extant sequence in an alignment with ASR methodology, a method we term “extant sequence reconstruction” (ESR). We thus can evaluate the accuracy of ASR methodology by comparing ESR reconstructions to the corresponding known true sequences. We find that a common measure of the quality of a reconstructed sequence, the average probability, is indeed a good estimate of the fraction of correct amino acids when the evolutionary model is accurate or overparameterized. However, the average probability is a poor measure for comparing reconstructions from different models, because, surprisingly, a more accurate phylogenetic model often results in reconstructions with lower probability. While better (more predictive) models may produce reconstructions with lower sequence identity to the true sequences, better models nevertheless produce reconstructions that are more biophysically similar to true ancestors. In addition, we find that a large fraction of sequences sampled from the reconstruction distribution may have fewer errors than the single most probable (SMP) sequence reconstruction, despite the fact that the SMP has the lowest expected error of all possible sequences. Our results emphasize the importance of model selection for ASR and the usefulness of sampling sequence reconstructions for analyzing ancestral protein properties. ESR is a powerful method for validating the evolutionary models used for ASR and can be applied in practice to any phylogenetic analysis of real biological sequences. Most significantly, ESR uses ASR methodology to provide a general method by which the biophysical properties of resurrected proteins can be compared to the properties of the true protein.
期刊介绍:
Journal of Molecular Evolution covers experimental, computational, and theoretical work aimed at deciphering features of molecular evolution and the processes bearing on these features, from the initial formation of macromolecular systems through their evolution at the molecular level, the co-evolution of their functions in cellular and organismal systems, and their influence on organismal adaptation, speciation, and ecology. Topics addressed include the evolution of informational macromolecules and their relation to more complex levels of biological organization, including populations and taxa, as well as the molecular basis for the evolution of ecological interactions of species and the use of molecular data to infer fundamental processes in evolutionary ecology. This coverage accommodates such subfields as new genome sequences, comparative structural and functional genomics, population genetics, the molecular evolution of development, the evolution of gene regulation and gene interaction networks, and in vitro evolution of DNA and RNA, molecular evolutionary ecology, and the development of methods and theory that enable molecular evolutionary inference, including but not limited to, phylogenetic methods.