Sagara N S Gurusinghe, Yibing Wu, William DeGrado, Julia M Shifman
{"title":"probass是一种具有序列和结构特征的语言模型,用于预测突变对结合亲和力的影响。","authors":"Sagara N S Gurusinghe, Yibing Wu, William DeGrado, Julia M Shifman","doi":"10.1093/bioinformatics/btaf270","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Protein-protein interactions (PPIs) govern virtually all cellular processes, and a single mutation within a PPI can significantly impact protein functionality, potentially leading to diseases. While numerous approaches have emerged to predict changes in the free energy of binding due to mutations (ΔΔGbind), most lack precision. Recently, protein language models (PLMs) have shown powerful predictive capabilities by leveraging both sequence and structural data from protein complexes, yet they have not been optimized specifically for ΔΔGbind prediction.</p><p><strong>Results: </strong>We developed an approach, ProBASS (Protein Binding Affinity from Structure and Sequence), to predict the effects of mutations on ΔΔGbind using two most advanced PLMs, ESM2 and ESM-IF1, which incorporate sequence and structural features, respectively. We first generated embeddings for each PPI mutant from the two PLMs and then fine-tuned ProBASS by training on a large dataset of experimental ΔΔGbind values. When training and testing were done on the same PPI, ProBASS achieved correlations with experimental ΔΔGbind values of 0.83 ± 0.05 and 0.69 ± 0.04 for single and double mutations, respectively. Additionally, when evaluated on a dataset of 2,325 single mutations across 131 PPIs, ProBASS reached a correlation of 0.81 ± 0.02, substantially outperforming other PLMs in predictive accuracy. Our results demonstrate that refining pre-trained PLMs with extensive ΔΔGbind datasets across multiple PPIs is a successful approach for creating a precise and broadly applicable ΔΔGbind prediction model, facilitating future protein engineering and design studies. ProBASS's accuracy could be further improved through training as more experimental data becomes available.</p><p><strong>Availability and implementation: </strong>ProBASS is available at: https://colab.research.google.com/github/sagagugit/ProBASS/blob/main/ProBASS.ipynb.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12151015/pdf/","citationCount":"0","resultStr":"{\"title\":\"ProBASS-a language model with sequence and structural features for predicting the effect of mutations on binding affinity.\",\"authors\":\"Sagara N S Gurusinghe, Yibing Wu, William DeGrado, Julia M Shifman\",\"doi\":\"10.1093/bioinformatics/btaf270\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Protein-protein interactions (PPIs) govern virtually all cellular processes, and a single mutation within a PPI can significantly impact protein functionality, potentially leading to diseases. While numerous approaches have emerged to predict changes in the free energy of binding due to mutations (ΔΔGbind), most lack precision. Recently, protein language models (PLMs) have shown powerful predictive capabilities by leveraging both sequence and structural data from protein complexes, yet they have not been optimized specifically for ΔΔGbind prediction.</p><p><strong>Results: </strong>We developed an approach, ProBASS (Protein Binding Affinity from Structure and Sequence), to predict the effects of mutations on ΔΔGbind using two most advanced PLMs, ESM2 and ESM-IF1, which incorporate sequence and structural features, respectively. We first generated embeddings for each PPI mutant from the two PLMs and then fine-tuned ProBASS by training on a large dataset of experimental ΔΔGbind values. When training and testing were done on the same PPI, ProBASS achieved correlations with experimental ΔΔGbind values of 0.83 ± 0.05 and 0.69 ± 0.04 for single and double mutations, respectively. Additionally, when evaluated on a dataset of 2,325 single mutations across 131 PPIs, ProBASS reached a correlation of 0.81 ± 0.02, substantially outperforming other PLMs in predictive accuracy. Our results demonstrate that refining pre-trained PLMs with extensive ΔΔGbind datasets across multiple PPIs is a successful approach for creating a precise and broadly applicable ΔΔGbind prediction model, facilitating future protein engineering and design studies. ProBASS's accuracy could be further improved through training as more experimental data becomes available.</p><p><strong>Availability and implementation: </strong>ProBASS is available at: https://colab.research.google.com/github/sagagugit/ProBASS/blob/main/ProBASS.ipynb.</p>\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12151015/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf270\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
动机:蛋白质-蛋白质相互作用(PPI)几乎控制着所有的细胞过程,PPI中的单个突变可以显著影响蛋白质功能,可能导致疾病。虽然已经出现了许多方法来预测由于突变引起的结合自由能的变化(ΔΔGbind),但大多数方法都缺乏精度。最近,蛋白质语言模型(PLMs)通过利用蛋白质复合物的序列和结构数据显示出强大的预测能力,但它们还没有专门针对ΔΔGbind预测进行优化。结果:我们开发了一种方法ProBASS (Protein Binding Affinity from Structure and Sequence),利用两个最先进的PLMs, ESM2和ESM-IF1,分别结合了序列和结构特征,来预测突变对ΔΔGbind的影响。我们首先从两个plm中为每个PPI突变体生成嵌入,然后通过训练大型实验数据集ΔΔGbind值来微调ProBASS。当对同一PPI进行训练和测试时,单突变和双突变时,ProBASS与实验ΔΔGbind值的相关性分别为0.83±0.05和0.69±0.04。此外,当对131个ppi的2325个单突变数据集进行评估时,ProBASS达到了0.81 ± 0.02的相关性,在预测准确性方面大大优于其他PLMs。我们的研究结果表明,在多个ppi中使用广泛的ΔΔGbind数据集来改进预训练的plm是一种成功的方法,可以创建精确且广泛适用的ΔΔGbind预测模型,促进未来的蛋白质工程和设计研究。随着实验数据的增多,ProBASS的准确性可以通过训练进一步提高。可用性:ProBASS可在:https://colab.research.google.com/github/sagagugit/ProBASS/blob/main/ProBASS.ipynb.Supplementary获取信息:补充数据可在Bioinformatics在线获取。
ProBASS-a language model with sequence and structural features for predicting the effect of mutations on binding affinity.
Motivation: Protein-protein interactions (PPIs) govern virtually all cellular processes, and a single mutation within a PPI can significantly impact protein functionality, potentially leading to diseases. While numerous approaches have emerged to predict changes in the free energy of binding due to mutations (ΔΔGbind), most lack precision. Recently, protein language models (PLMs) have shown powerful predictive capabilities by leveraging both sequence and structural data from protein complexes, yet they have not been optimized specifically for ΔΔGbind prediction.
Results: We developed an approach, ProBASS (Protein Binding Affinity from Structure and Sequence), to predict the effects of mutations on ΔΔGbind using two most advanced PLMs, ESM2 and ESM-IF1, which incorporate sequence and structural features, respectively. We first generated embeddings for each PPI mutant from the two PLMs and then fine-tuned ProBASS by training on a large dataset of experimental ΔΔGbind values. When training and testing were done on the same PPI, ProBASS achieved correlations with experimental ΔΔGbind values of 0.83 ± 0.05 and 0.69 ± 0.04 for single and double mutations, respectively. Additionally, when evaluated on a dataset of 2,325 single mutations across 131 PPIs, ProBASS reached a correlation of 0.81 ± 0.02, substantially outperforming other PLMs in predictive accuracy. Our results demonstrate that refining pre-trained PLMs with extensive ΔΔGbind datasets across multiple PPIs is a successful approach for creating a precise and broadly applicable ΔΔGbind prediction model, facilitating future protein engineering and design studies. ProBASS's accuracy could be further improved through training as more experimental data becomes available.
Availability and implementation: ProBASS is available at: https://colab.research.google.com/github/sagagugit/ProBASS/blob/main/ProBASS.ipynb.