超快速经典系统发育方法在变异效应预测上优于大型蛋白质语言模型。

Advances in neural information processing systems Pub Date : 2024-01-01

Sebastian Prillo, Wilson Wu, Yun S Song

{"title":"超快速经典系统发育方法在变异效应预测上优于大型蛋白质语言模型。","authors":"Sebastian Prillo, Wilson Wu, Yun S Song","doi":"","DOIUrl":null,"url":null,"abstract":"Amino acid substitution rate matrices are fundamental to statistical phylogenetics and evolutionary biology. Estimating them typically requires reconstructed trees for massive amounts of aligned proteins, which poses a major computational bottleneck. In this paper, we develop a near-linear time method to estimate these rate matrices from multiple sequence alignments (MSAs) alone, thereby speeding up computation by orders of magnitude. Our method relies on a near-linear time cherry reconstruction algorithm which we call FastCherries and it can be easily applied to MSAs with millions of sequences. On both simulated and real data, we demonstrate the speed and accuracy of our method as applied to the classical model of protein evolution. By leveraging the unprecedented scalability of our method, we develop a new, rich phylogenetic model called SiteRM, which can estimate a general site-specific rate matrix for each column of an MSA. Remarkably, in variant effect prediction for both clinical and deep mutational scanning data in ProteinGym, we show that despite being an independent-sites model, our SiteRM model outperforms large protein language models that learn complex residue-residue interactions between different sites. We attribute our increased performance to conceptual advances in our probabilistic treatment of evolutionary data and our ability to handle extremely large MSAs. We anticipate that our work will have a lasting impact across both statistical phylogenetics and computational variant effect prediction. FastCherries and SiteRM are implemented in the CherryML package https://github.com/songlab-cal/CherryML.","PeriodicalId":72099,"journal":{"name":"Advances in neural information processing systems","volume":"37 ","pages":"130265-130290"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12143485/pdf/","citationCount":"0","resultStr":"{\"title\":\"Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction.\",\"authors\":\"Sebastian Prillo, Wilson Wu, Yun S Song\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Amino acid substitution rate matrices are fundamental to statistical phylogenetics and evolutionary biology. Estimating them typically requires reconstructed trees for massive amounts of aligned proteins, which poses a major computational bottleneck. In this paper, we develop a near-linear time method to estimate these rate matrices from multiple sequence alignments (MSAs) alone, thereby speeding up computation by orders of magnitude. Our method relies on a near-linear time cherry reconstruction algorithm which we call FastCherries and it can be easily applied to MSAs with millions of sequences. On both simulated and real data, we demonstrate the speed and accuracy of our method as applied to the classical model of protein evolution. By leveraging the unprecedented scalability of our method, we develop a new, rich phylogenetic model called SiteRM, which can estimate a general site-specific rate matrix for each column of an MSA. Remarkably, in variant effect prediction for both clinical and deep mutational scanning data in ProteinGym, we show that despite being an independent-sites model, our SiteRM model outperforms large protein language models that learn complex residue-residue interactions between different sites. We attribute our increased performance to conceptual advances in our probabilistic treatment of evolutionary data and our ability to handle extremely large MSAs. We anticipate that our work will have a lasting impact across both statistical phylogenetics and computational variant effect prediction. FastCherries and SiteRM are implemented in the CherryML package https://github.com/songlab-cal/CherryML.\",\"PeriodicalId\":72099,\"journal\":{\"name\":\"Advances in neural information processing systems\",\"volume\":\"37 \",\"pages\":\"130265-130290\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12143485/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in neural information processing systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in neural information processing systems","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

氨基酸取代率矩阵是统计系统发育和进化生物学的基础。估计它们通常需要为大量排列的蛋白质重建树，这构成了一个主要的计算瓶颈。在本文中，我们开发了一种近线性时间方法来估计这些速率矩阵，从而将计算速度提高了几个数量级。我们的方法依赖于一种近似线性的时间樱桃重建算法，我们称之为fastcherry，它可以很容易地应用于具有数百万序列的msa。在模拟和实际数据上，我们证明了我们的方法适用于经典的蛋白质进化模型的速度和准确性。通过利用我们的方法的前所未有的可扩展性，我们开发了一个新的，丰富的系统发育模型称为SiteRM，它可以估计一个一般的位点特定率矩阵的每一列的MSA。值得注意的是，在ProteinGym的临床和深度突变扫描数据的变异效应预测中，我们表明，尽管是一个独立的位点模型，但我们的SiteRM模型优于学习不同位点之间复杂残基-残基相互作用的大型蛋白质语言模型。我们将性能的提高归功于我们对进化数据的概率处理的概念进步，以及我们处理超大msa的能力。我们预计我们的工作将对统计系统发育和计算变异效应预测产生持久的影响。fastcherry和SiteRM在CherryML包https://github.com/songlab-cal/CherryML中实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction.

Amino acid substitution rate matrices are fundamental to statistical phylogenetics and evolutionary biology. Estimating them typically requires reconstructed trees for massive amounts of aligned proteins, which poses a major computational bottleneck. In this paper, we develop a near-linear time method to estimate these rate matrices from multiple sequence alignments (MSAs) alone, thereby speeding up computation by orders of magnitude. Our method relies on a near-linear time cherry reconstruction algorithm which we call FastCherries and it can be easily applied to MSAs with millions of sequences. On both simulated and real data, we demonstrate the speed and accuracy of our method as applied to the classical model of protein evolution. By leveraging the unprecedented scalability of our method, we develop a new, rich phylogenetic model called SiteRM, which can estimate a general site-specific rate matrix for each column of an MSA. Remarkably, in variant effect prediction for both clinical and deep mutational scanning data in ProteinGym, we show that despite being an independent-sites model, our SiteRM model outperforms large protein language models that learn complex residue-residue interactions between different sites. We attribute our increased performance to conceptual advances in our probabilistic treatment of evolutionary data and our ability to handle extremely large MSAs. We anticipate that our work will have a lasting impact across both statistical phylogenetics and computational variant effect prediction. FastCherries and SiteRM are implemented in the CherryML package https://github.com/songlab-cal/CherryML.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advances in neural information processing systems

自引率

0.00%

发文量