{"title":"Enhancing Structure-Aware Protein Language Models with Efficient Fine-Tuning for Various Protein Prediction Tasks.","authors":"Yichuan Zhang, Yongfang Qin, Mahdi Pourmirzaei, Qing Shao, Duolin Wang, Dong Xu","doi":"10.1007/978-1-0716-4623-6_2","DOIUrl":null,"url":null,"abstract":"<p><p>Proteins are crucial in a wide range of biological and engineering processes. Large protein language models (PLMs) can significantly advance our understanding and engineering of proteins. However, the effectiveness of PLMs in prediction and design is largely based on the representations derived from protein sequences. Without incorporating the three-dimensional (3D) structures of proteins, PLMs would overlook crucial aspects of how proteins interact with other molecules, thereby limiting their predictive accuracy. To address this issue, we present S-PLM, a 3D structure-aware PLM, that employs multi-view contrastive learning to align protein sequences with their 3D structures in a unified latent space. Previously, we utilized a contact map-based approach to encode structural information, applying the Swin-Transformer to contact maps derived from AlphaFold-predicted protein structures. This work introduces a new approach that leverages a geometric vector perceptron (GVP) model to process 3D coordinates and obtain structural embeddings. We focus on the application of structure-aware models for protein-related tasks by utilizing efficient fine-tuning methods to achieve optimal performance without significant computational costs. Our results show that S-PLM outperforms sequence-only PLMs across all protein clustering and classification tasks, achieving performance on par with state-of-the-art methods that require both sequence and structure inputs. S-PLM and its tuning tools are available at https://github.com/duolinwang/S-PLM/ .</p>","PeriodicalId":18490,"journal":{"name":"Methods in molecular biology","volume":"2941 ","pages":"31-58"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods in molecular biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-1-0716-4623-6_2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}
引用次数: 0
Abstract
Proteins are crucial in a wide range of biological and engineering processes. Large protein language models (PLMs) can significantly advance our understanding and engineering of proteins. However, the effectiveness of PLMs in prediction and design is largely based on the representations derived from protein sequences. Without incorporating the three-dimensional (3D) structures of proteins, PLMs would overlook crucial aspects of how proteins interact with other molecules, thereby limiting their predictive accuracy. To address this issue, we present S-PLM, a 3D structure-aware PLM, that employs multi-view contrastive learning to align protein sequences with their 3D structures in a unified latent space. Previously, we utilized a contact map-based approach to encode structural information, applying the Swin-Transformer to contact maps derived from AlphaFold-predicted protein structures. This work introduces a new approach that leverages a geometric vector perceptron (GVP) model to process 3D coordinates and obtain structural embeddings. We focus on the application of structure-aware models for protein-related tasks by utilizing efficient fine-tuning methods to achieve optimal performance without significant computational costs. Our results show that S-PLM outperforms sequence-only PLMs across all protein clustering and classification tasks, achieving performance on par with state-of-the-art methods that require both sequence and structure inputs. S-PLM and its tuning tools are available at https://github.com/duolinwang/S-PLM/ .
蛋白质在广泛的生物和工程过程中是至关重要的。大型蛋白质语言模型(Large protein language models, PLMs)可以极大地促进我们对蛋白质的理解和工程设计。然而,PLMs在预测和设计方面的有效性在很大程度上是基于蛋白质序列的表示。如果不结合蛋白质的三维(3D)结构,PLMs将忽略蛋白质如何与其他分子相互作用的关键方面,从而限制了其预测的准确性。为了解决这个问题,我们提出了S-PLM,一种3D结构感知PLM,它采用多视图对比学习在统一的潜在空间中将蛋白质序列与其3D结构对齐。之前,我们利用基于接触图的方法来编码结构信息,将swan - transformer应用于从alphafold预测的蛋白质结构中导出的接触图。这项工作介绍了一种利用几何矢量感知器(GVP)模型处理3D坐标并获得结构嵌入的新方法。我们专注于结构感知模型在蛋白质相关任务中的应用,利用有效的微调方法在没有显著计算成本的情况下实现最佳性能。我们的研究结果表明,S-PLM在所有蛋白质聚类和分类任务中都优于仅序列的plm,其性能与需要序列和结构输入的最先进方法相当。S-PLM及其调优工具可从https://github.com/duolinwang/S-PLM/获得。
期刊介绍:
For over 20 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by-step fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice.