Multimodal pretraining for unsupervised protein representation learning.

IF 2.5 Q3 BIOCHEMICAL RESEARCH METHODS

Biology Methods and Protocols Pub Date : 2024-06-18 eCollection Date: 2024-01-01 DOI:10.1093/biomethods/bpae043

Viet Thanh Duy Nguyen, Truong Son Hy

{"title":"Multimodal pretraining for unsupervised protein representation learning.","authors":"Viet Thanh Duy Nguyen, Truong Son Hy","doi":"10.1093/biomethods/bpae043","DOIUrl":null,"url":null,"abstract":"<p><p>Proteins are complex biomolecules essential for numerous biological processes, making them crucial targets for advancements in molecular biology, medical research, and drug design. Understanding their intricate, hierarchical structures, and functions is vital for progress in these fields. To capture this complexity, we introduce Multimodal Protein Representation Learning (MPRL), a novel framework for symmetry-preserving multimodal pretraining that learns unified, unsupervised protein representations by integrating primary and tertiary structures. MPRL employs Evolutionary Scale Modeling (ESM-2) for sequence analysis, Variational Graph Auto-Encoders (VGAE) for residue-level graphs, and PointNet Autoencoder (PAE) for 3D point clouds of atoms, each designed to capture the spatial and evolutionary intricacies of proteins while preserving critical symmetries. By leveraging Auto-Fusion to synthesize joint representations from these pretrained models, MPRL ensures robust and comprehensive protein representations. Our extensive evaluation demonstrates that MPRL significantly enhances performance in various tasks such as protein-ligand binding affinity prediction, protein fold classification, enzyme activity identification, and mutation stability prediction. This framework advances the understanding of protein dynamics and facilitates future research in the field. Our source code is publicly available at https://github.com/HySonLab/Protein_Pretrain.</p>","PeriodicalId":36528,"journal":{"name":"Biology Methods and Protocols","volume":"9 1","pages":"bpae043"},"PeriodicalIF":2.5000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11233121/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology Methods and Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/biomethods/bpae043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Proteins are complex biomolecules essential for numerous biological processes, making them crucial targets for advancements in molecular biology, medical research, and drug design. Understanding their intricate, hierarchical structures, and functions is vital for progress in these fields. To capture this complexity, we introduce Multimodal Protein Representation Learning (MPRL), a novel framework for symmetry-preserving multimodal pretraining that learns unified, unsupervised protein representations by integrating primary and tertiary structures. MPRL employs Evolutionary Scale Modeling (ESM-2) for sequence analysis, Variational Graph Auto-Encoders (VGAE) for residue-level graphs, and PointNet Autoencoder (PAE) for 3D point clouds of atoms, each designed to capture the spatial and evolutionary intricacies of proteins while preserving critical symmetries. By leveraging Auto-Fusion to synthesize joint representations from these pretrained models, MPRL ensures robust and comprehensive protein representations. Our extensive evaluation demonstrates that MPRL significantly enhances performance in various tasks such as protein-ligand binding affinity prediction, protein fold classification, enzyme activity identification, and mutation stability prediction. This framework advances the understanding of protein dynamics and facilitates future research in the field. Our source code is publicly available at https://github.com/HySonLab/Protein_Pretrain.

查看原文本刊更多论文

用于无监督蛋白质表征学习的多模式预训练。

蛋白质是复杂的生物大分子，对许多生物过程至关重要，因此成为分子生物学、医学研究和药物设计领域取得进展的重要目标。了解它们错综复杂的层次结构和功能对这些领域的研究进展至关重要。为了捕捉这种复杂性，我们引入了多模态蛋白质表征学习（MPRL），这是一种用于对称性保护多模态预训练的新型框架，它通过整合一级和三级结构来学习统一的、无监督的蛋白质表征。MPRL 采用进化尺度建模（ESM-2）进行序列分析，采用变异图自动编码器（VGAE）进行残基级图形分析，采用点网自动编码器（PAE）进行三维原子点云分析，每种方法都旨在捕捉蛋白质在空间和进化方面的复杂性，同时保留关键的对称性。通过利用自动融合（Auto-Fusion）技术从这些预训练模型中合成联合表征，MPRL 确保了稳健而全面的蛋白质表征。我们的广泛评估表明，MPRL 显著提高了蛋白质配体结合亲和力预测、蛋白质折叠分类、酶活性识别和突变稳定性预测等各种任务的性能。该框架促进了对蛋白质动力学的理解，并推动了该领域的未来研究。我们的源代码可在 https://github.com/HySonLab/Protein_Pretrain 公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊