S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure

IF 14.3 1区 材料科学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Duolin Wang, Mahdi Pourmirzaei, Usman L. Abbas, Shuai Zeng, Negin Manshour, Farzaneh Esmaili, Biplab Poudel, Yuexu Jiang, Qing Shao, Jin Chen, Dong Xu
{"title":"S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure","authors":"Duolin Wang,&nbsp;Mahdi Pourmirzaei,&nbsp;Usman L. Abbas,&nbsp;Shuai Zeng,&nbsp;Negin Manshour,&nbsp;Farzaneh Esmaili,&nbsp;Biplab Poudel,&nbsp;Yuexu Jiang,&nbsp;Qing Shao,&nbsp;Jin Chen,&nbsp;Dong Xu","doi":"10.1002/advs.202404212","DOIUrl":null,"url":null,"abstract":"<p>Proteins play an essential role in various biological and engineering processes. Large protein language models (PLMs) present excellent potential to reshape protein research by accelerating the determination of protein functions and the design of proteins with the desired functions. The prediction and design capacity of PLMs relies on the representation gained from the protein sequences. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily dependent on 3D structures. To address this issue, S-PLM is introduced as a 3D structure-aware PLM that utilizes multi-view contrastive learning to align the sequence and 3D structure of a protein in a coordinated latent space. S-PLM applies Swin-Transformer on AlphaFold-predicted protein structures to embed the structural information and fuses it into sequence-based embedding from ESM2. Additionally, a library of lightweight tuning tools is provided to adapt S-PLM for diverse downstream protein prediction tasks. The results demonstrate S-PLM's superior performance over sequence-only PLMs on all protein clustering and classification tasks, achieving competitiveness comparable to state-of-the-art methods requiring both sequence and structure inputs. S-PLM and its lightweight tuning tools are available at https://github.com/duolinwang/S-PLM/.</p>","PeriodicalId":117,"journal":{"name":"Advanced Science","volume":"12 5","pages":""},"PeriodicalIF":14.3000,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/advs.202404212","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced Science","FirstCategoryId":"88","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/advs.202404212","RegionNum":1,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Proteins play an essential role in various biological and engineering processes. Large protein language models (PLMs) present excellent potential to reshape protein research by accelerating the determination of protein functions and the design of proteins with the desired functions. The prediction and design capacity of PLMs relies on the representation gained from the protein sequences. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily dependent on 3D structures. To address this issue, S-PLM is introduced as a 3D structure-aware PLM that utilizes multi-view contrastive learning to align the sequence and 3D structure of a protein in a coordinated latent space. S-PLM applies Swin-Transformer on AlphaFold-predicted protein structures to embed the structural information and fuses it into sequence-based embedding from ESM2. Additionally, a library of lightweight tuning tools is provided to adapt S-PLM for diverse downstream protein prediction tasks. The results demonstrate S-PLM's superior performance over sequence-only PLMs on all protein clustering and classification tasks, achieving competitiveness comparable to state-of-the-art methods requiring both sequence and structure inputs. S-PLM and its lightweight tuning tools are available at https://github.com/duolinwang/S-PLM/.

Abstract Image

基于序列和结构对比学习的结构感知蛋白质语言模型。
蛋白质在各种生物和工程过程中起着至关重要的作用。大型蛋白质语言模型(PLMs)通过加速蛋白质功能的确定和具有所需功能的蛋白质的设计,在重塑蛋白质研究方面具有良好的潜力。PLMs的预测和设计能力依赖于从蛋白质序列中获得的表示。然而,大多数PLMs缺乏关键的3D结构信息,这限制了PLMs在各种应用中的预测能力,特别是在高度依赖3D结构的应用中。为了解决这个问题,S-PLM作为一种3D结构感知PLM被引入,它利用多视图对比学习来对齐协调潜在空间中蛋白质的序列和3D结构。S-PLM利用swwin - transformer在alphafold预测的蛋白质结构上嵌入结构信息,并将其融合到ESM2的序列嵌入中。此外,还提供了一个轻量级调优工具库,以使S-PLM适应各种下游蛋白质预测任务。结果表明,S-PLM在所有蛋白质聚类和分类任务上优于仅序列的plm,具有与需要序列和结构输入的最先进方法相当的竞争力。S-PLM及其轻量级调优工具可从https://github.com/duolinwang/S-PLM/获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Advanced Science
Advanced Science CHEMISTRY, MULTIDISCIPLINARYNANOSCIENCE &-NANOSCIENCE & NANOTECHNOLOGY
CiteScore
18.90
自引率
2.60%
发文量
1602
审稿时长
1.9 months
期刊介绍: Advanced Science is a prestigious open access journal that focuses on interdisciplinary research in materials science, physics, chemistry, medical and life sciences, and engineering. The journal aims to promote cutting-edge research by employing a rigorous and impartial review process. It is committed to presenting research articles with the highest quality production standards, ensuring maximum accessibility of top scientific findings. With its vibrant and innovative publication platform, Advanced Science seeks to revolutionize the dissemination and organization of scientific knowledge.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信