Impact of phylogeny on the inference of functional sectors from protein sequence data.

IF 3.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

PLoS Computational Biology Pub Date : 2024-09-23 eCollection Date: 2024-09-01 DOI:10.1371/journal.pcbi.1012091

Nicola Dietler, Alia Abbara, Subham Choudhury, Anne-Florence Bitbol

{"title":"Impact of phylogeny on the inference of functional sectors from protein sequence data.","authors":"Nicola Dietler, Alia Abbara, Subham Choudhury, Anne-Florence Bitbol","doi":"10.1371/journal.pcbi.1012091","DOIUrl":null,"url":null,"abstract":"<p><p>Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"20 9","pages":"e1012091"},"PeriodicalIF":3.8000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11449291/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1012091","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.

查看原文本刊更多论文

系统发育对从蛋白质序列数据推断功能区的影响。

通过对同源蛋白质的多序列比对进行统计分析，发现了一些共同进化的氨基酸群，这些氨基酸群被称为 "扇区"。这些氨基酸位点群在氨基酸使用上具有集体相关性，并与功能特性相关联。建模结果表明，蛋白质的加性功能特性的非线性选择一般会产生一个功能区。这些建模结果激发了一种名为 ICOD 的原则性方法，该方法旨在从序列数据中识别功能区以及突变效应。然而，对于所有旨在从多序列比对中识别功能区的方法来说，一个挑战是氨基酸使用的相关性也可能仅仅源于同源序列具有共同祖先这一事实，即源于系统发育。在这里，我们从一个包含系统发育和功能部门的最小模型中生成受控合成数据。我们利用这些数据剖析了系统发育对扇区识别的影响，以及不同方法对突变效应推断的影响。我们发现，ICOD 对系统发育最稳健，但保守性也相当稳健。接下来，我们考虑了有深度突变扫描实验数据的蛋白质家族的天然多序列比对。我们发现，在这些自然数据中，保守性和 ICOD 最能识别具有强大功能作用的位点，这与我们在合成数据上的结果一致。重要的是，这两种方法有不同的前提，因为它们分别侧重于守恒性和相关性。因此，它们的联合使用可以揭示互补信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS Computational Biology BIOCHEMICAL RESEARCH METHODS-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.10

自引率

4.70%

发文量

820

审稿时长

2.5 months

期刊介绍： PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery. Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines. Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights. Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology. Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.