Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy

IF 12 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Nature computational science Pub Date : 2024-10-25 DOI:10.1038/s43588-024-00716-2

Yunxin Xu, Di Liu, Haipeng Gong

{"title":"Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy","authors":"Yunxin Xu, Di Liu, Haipeng Gong","doi":"10.1038/s43588-024-00716-2","DOIUrl":null,"url":null,"abstract":"Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models—GeoFitness, GeoDDG and GeoDTm—for the prediction of fitness score, ΔΔG and ΔTm of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔTm prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient. In this study, the authors propose a strategy to train a unified model to learn the general mutational effects based on multi-labeled deep mutational scanning (DMS) data, and then reutilize this pre-trained model to improve the downstream protein stability prediction tasks.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"4 11","pages":"840-850"},"PeriodicalIF":12.0000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature computational science","FirstCategoryId":"1085","ListUrlMain":"https://www.nature.com/articles/s43588-024-00716-2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models—GeoFitness, GeoDDG and GeoDTm—for the prediction of fitness score, ΔΔG and ΔTm of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔTm prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient. In this study, the authors propose a strategy to train a unified model to learn the general mutational effects based on multi-labeled deep mutational scanning (DMS) data, and then reutilize this pre-trained model to improve the downstream protein stability prediction tasks.

Abstract Image

查看原文本刊更多论文

通过几何学习和预训练策略改进突变后蛋白质稳定性变化的预测。

准确预测蛋白质突变效应对蛋白质工程和设计至关重要。在此，我们提出了 GeoStab-suite，这是一套由 GeoFitness、GeoDDG 和 GeoDTm 三种基于几何学习的模型组成的套件，分别用于预测蛋白质突变后的适应度得分、ΔΔG 和ΔTm。GeoFitness 使用专门的损失函数，利用深度突变扫描数据库中的大量多标签适配性数据对统一模型进行监督训练。为了进一步改进ΔΔG和ΔTm预测的下游任务，GeoFitness的编码器被重新用作GeoDDG和GeoDTm的预训练模块，以克服缺乏足够标记数据的挑战。这种预训练策略与数据扩展相结合，显著提高了模型的性能和普适性。在基准测试中，GeoDDG 和 GeoDTm 的斯皮尔曼相关系数分别比其他先进方法高出至少 30% 和 70%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nature computational science

CiteScore

11.70

自引率

0.00%

发文量