{"title":"ProtMamba:一个同源感知但无比对的蛋白质状态空间模型。","authors":"Damiano Sgarbossa, Cyril Malbranke, Anne-Florence Bitbol","doi":"10.1093/bioinformatics/btaf348","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Protein language models are enabling advances in elucidating the sequence-to-function mapping, and have important applications in protein design. Models based on multiple sequence alignments efficiently capture the evolutionary information in homologous protein sequences, but multiple sequence alignment construction is imperfect.</p><p><strong>Results: </strong>We present ProtMamba, a homology-aware but alignment-free protein language model based on the Mamba architecture. In contrast with attention-based models, ProtMamba efficiently handles very long context, comprising hundreds of protein sequences. It is also computationally efficient. We train ProtMamba on a large dataset of concatenated homologous sequences, using two GPUs. We combine autoregressive modeling and masked language modeling through a fill-in-the-middle training objective. This makes the model adapted to various protein design applications. We demonstrate ProtMamba's usefulness for sequence generation, motif inpainting, fitness prediction, and modeling intrinsically disordered regions. For homolog-conditioned sequence generation, ProtMamba outperforms state-of-the-art models. ProtMamba's competitive performance, despite its relatively small size, sheds light on the importance of long-context conditioning.</p><p><strong>Availability: </strong>A Python implementation of ProtMamba is freely available in our GitHub repository: https://github.com/Bitbol-Lab/ProtMamba-ssm and archived at https://doi.org/10.5281/zenodo.15584634.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ProtMamba: a homology-aware but alignment-free protein state space model.\",\"authors\":\"Damiano Sgarbossa, Cyril Malbranke, Anne-Florence Bitbol\",\"doi\":\"10.1093/bioinformatics/btaf348\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Protein language models are enabling advances in elucidating the sequence-to-function mapping, and have important applications in protein design. Models based on multiple sequence alignments efficiently capture the evolutionary information in homologous protein sequences, but multiple sequence alignment construction is imperfect.</p><p><strong>Results: </strong>We present ProtMamba, a homology-aware but alignment-free protein language model based on the Mamba architecture. In contrast with attention-based models, ProtMamba efficiently handles very long context, comprising hundreds of protein sequences. It is also computationally efficient. We train ProtMamba on a large dataset of concatenated homologous sequences, using two GPUs. We combine autoregressive modeling and masked language modeling through a fill-in-the-middle training objective. This makes the model adapted to various protein design applications. We demonstrate ProtMamba's usefulness for sequence generation, motif inpainting, fitness prediction, and modeling intrinsically disordered regions. For homolog-conditioned sequence generation, ProtMamba outperforms state-of-the-art models. ProtMamba's competitive performance, despite its relatively small size, sheds light on the importance of long-context conditioning.</p><p><strong>Availability: </strong>A Python implementation of ProtMamba is freely available in our GitHub repository: https://github.com/Bitbol-Lab/ProtMamba-ssm and archived at https://doi.org/10.5281/zenodo.15584634.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf348\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ProtMamba: a homology-aware but alignment-free protein state space model.
Motivation: Protein language models are enabling advances in elucidating the sequence-to-function mapping, and have important applications in protein design. Models based on multiple sequence alignments efficiently capture the evolutionary information in homologous protein sequences, but multiple sequence alignment construction is imperfect.
Results: We present ProtMamba, a homology-aware but alignment-free protein language model based on the Mamba architecture. In contrast with attention-based models, ProtMamba efficiently handles very long context, comprising hundreds of protein sequences. It is also computationally efficient. We train ProtMamba on a large dataset of concatenated homologous sequences, using two GPUs. We combine autoregressive modeling and masked language modeling through a fill-in-the-middle training objective. This makes the model adapted to various protein design applications. We demonstrate ProtMamba's usefulness for sequence generation, motif inpainting, fitness prediction, and modeling intrinsically disordered regions. For homolog-conditioned sequence generation, ProtMamba outperforms state-of-the-art models. ProtMamba's competitive performance, despite its relatively small size, sheds light on the importance of long-context conditioning.
Availability: A Python implementation of ProtMamba is freely available in our GitHub repository: https://github.com/Bitbol-Lab/ProtMamba-ssm and archived at https://doi.org/10.5281/zenodo.15584634.
Supplementary information: Supplementary data are available at Bioinformatics online.