Daria Frolova, Marina Pak, Anna Litvin, Ilya Sharov, Dmitry Ivankov, Ivan Oseledets
{"title":"MULAN:序列和结构编码的多模态蛋白质语言模型。","authors":"Daria Frolova, Marina Pak, Anna Litvin, Ilya Sharov, Dmitry Ivankov, Ivan Oseledets","doi":"10.1093/bioadv/vbaf117","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Most protein language models (PLMs) produce high-quality representations using only protein sequences. However, incorporating known protein structures is important for many prediction tasks, leading to increased interest in structure-aware PLMs. Currently, structure-aware PLMs are either trained from scratch or add significant parameter overhead for the structure encoder.</p><p><strong>Results: </strong>In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced parameter-efficient Structure Adapter, which are then fused and trained together. Based on the evaluation of nine downstream tasks, MULAN models of various sizes show a quality improvement compared to both sequence-only ESM2 and structure-aware SaProt. The main improvements are shown for the protein-protein interaction prediction (up to 0.12 in AUROC). Importantly, unlike other models, MULAN offers a cheap increase in structural awareness of protein representations because of the finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure.</p><p><strong>Availability and implementation: </strong>The implementation, training data, and model checkpoints are available at https://github.com/DFrolova/MULAN.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf117"},"PeriodicalIF":2.8000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12452268/pdf/","citationCount":"0","resultStr":"{\"title\":\"MULAN: multimodal protein language model for sequence and structure encoding.\",\"authors\":\"Daria Frolova, Marina Pak, Anna Litvin, Ilya Sharov, Dmitry Ivankov, Ivan Oseledets\",\"doi\":\"10.1093/bioadv/vbaf117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Most protein language models (PLMs) produce high-quality representations using only protein sequences. However, incorporating known protein structures is important for many prediction tasks, leading to increased interest in structure-aware PLMs. Currently, structure-aware PLMs are either trained from scratch or add significant parameter overhead for the structure encoder.</p><p><strong>Results: </strong>In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced parameter-efficient Structure Adapter, which are then fused and trained together. Based on the evaluation of nine downstream tasks, MULAN models of various sizes show a quality improvement compared to both sequence-only ESM2 and structure-aware SaProt. The main improvements are shown for the protein-protein interaction prediction (up to 0.12 in AUROC). Importantly, unlike other models, MULAN offers a cheap increase in structural awareness of protein representations because of the finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure.</p><p><strong>Availability and implementation: </strong>The implementation, training data, and model checkpoints are available at https://github.com/DFrolova/MULAN.</p>\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"5 1\",\"pages\":\"vbaf117\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12452268/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbaf117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
MULAN: multimodal protein language model for sequence and structure encoding.
Motivation: Most protein language models (PLMs) produce high-quality representations using only protein sequences. However, incorporating known protein structures is important for many prediction tasks, leading to increased interest in structure-aware PLMs. Currently, structure-aware PLMs are either trained from scratch or add significant parameter overhead for the structure encoder.
Results: In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced parameter-efficient Structure Adapter, which are then fused and trained together. Based on the evaluation of nine downstream tasks, MULAN models of various sizes show a quality improvement compared to both sequence-only ESM2 and structure-aware SaProt. The main improvements are shown for the protein-protein interaction prediction (up to 0.12 in AUROC). Importantly, unlike other models, MULAN offers a cheap increase in structural awareness of protein representations because of the finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure.
Availability and implementation: The implementation, training data, and model checkpoints are available at https://github.com/DFrolova/MULAN.