利用生物诱导的稀疏注意将蛋白质语言模型扩展到病毒基因组规模。

bioRxiv : the preprint server for biology Pub Date : 2025-06-11 DOI:10.1101/2025.05.29.656907

Thibaut Dejean, Barbra D Ferrell, William Harrigan, Zachary D Schreiber, Rajan Sawhney, K Eric Wommack, Shawn W Polson, Mahdi Belcaid

{"title":"利用生物诱导的稀疏注意将蛋白质语言模型扩展到病毒基因组规模。","authors":"Thibaut Dejean, Barbra D Ferrell, William Harrigan, Zachary D Schreiber, Rajan Sawhney, K Eric Wommack, Shawn W Polson, Mahdi Belcaid","doi":"10.1101/2025.05.29.656907","DOIUrl":null,"url":null,"abstract":"The transformer architecture in deep learning has revolutionized protein sequence analysis. Recent advancements in protein language models have paved the way for significant progress across various domains, including protein function and structure prediction, multiple sequence alignments and mutation effect prediction. A protein language model is commonly trained on individual proteins, ignoring the interdependencies between sequences within a genome. However, biological understanding reveals that protein-protein interactions span entire genomic regions, underscoring the limitations of focusing solely on individual proteins. To address these limitations, we propose a novel approach that extends the context size of transformer models across the entire viral genome. By training on large genomic fragments, our method captures long-range interprotein interactions and encodes protein sequences with integrated information from distant proteins within the same genome, offering substantial benefits in various tasks. Viruses, with their densely packed genomes, minimal intergenic regions, and protein annotation challenges, are ideal candidates for genome-wide learning. We introduce a long-context protein language model, trained on entire viral genomes, leveraging a sparse attention mechanism based on protein-protein interactions. Our semi-supervised approach supports long sequences of up to 61,000 amino acids (aa). Our evaluations demonstrate that the resulting embeddings significantly surpass those generated by single-protein models and outperform alternative large-context architectures that rely on static masking or non-transformer frameworks.","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12154925/pdf/","citationCount":"0","resultStr":"{\"title\":\"Extending Protein Language Models to a Viral Genomic Scale Using Biologically Induced Sparse Attention.\",\"authors\":\"Thibaut Dejean, Barbra D Ferrell, William Harrigan, Zachary D Schreiber, Rajan Sawhney, K Eric Wommack, Shawn W Polson, Mahdi Belcaid\",\"doi\":\"10.1101/2025.05.29.656907\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The transformer architecture in deep learning has revolutionized protein sequence analysis. Recent advancements in protein language models have paved the way for significant progress across various domains, including protein function and structure prediction, multiple sequence alignments and mutation effect prediction. A protein language model is commonly trained on individual proteins, ignoring the interdependencies between sequences within a genome. However, biological understanding reveals that protein-protein interactions span entire genomic regions, underscoring the limitations of focusing solely on individual proteins. To address these limitations, we propose a novel approach that extends the context size of transformer models across the entire viral genome. By training on large genomic fragments, our method captures long-range interprotein interactions and encodes protein sequences with integrated information from distant proteins within the same genome, offering substantial benefits in various tasks. Viruses, with their densely packed genomes, minimal intergenic regions, and protein annotation challenges, are ideal candidates for genome-wide learning. We introduce a long-context protein language model, trained on entire viral genomes, leveraging a sparse attention mechanism based on protein-protein interactions. Our semi-supervised approach supports long sequences of up to 61,000 amino acids (aa). Our evaluations demonstrate that the resulting embeddings significantly surpass those generated by single-protein models and outperform alternative large-context architectures that rely on static masking or non-transformer frameworks.\",\"PeriodicalId\":519960,\"journal\":{\"name\":\"bioRxiv : the preprint server for biology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12154925/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv : the preprint server for biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2025.05.29.656907\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.05.29.656907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

深度学习中的变压器架构彻底改变了蛋白质序列分析。蛋白质语言模型的最新进展为蛋白质功能和结构预测、多序列比对和突变效应预测等多个领域的重大进展铺平了道路。蛋白质语言模型通常是针对单个蛋白质进行训练，而忽略了基因组中序列之间的相互依赖性。然而，生物学的理解揭示了蛋白质-蛋白质相互作用跨越整个基因组区域，强调了仅关注单个蛋白质的局限性。为了解决这些限制，我们提出了一种新的方法，将变压器模型的上下文大小扩展到整个病毒基因组。通过对大基因组片段的训练，我们的方法捕获了远距离蛋白质间的相互作用，并利用来自同一基因组内远端蛋白质的综合信息编码蛋白质序列，为各种任务提供了实质性的好处。病毒具有密集的基因组、最小的基因间区域和蛋白质注释挑战，是全基因组学习的理想候选者。我们引入了一个长上下文蛋白质语言模型，在整个病毒基因组上进行训练，利用基于蛋白质-蛋白质相互作用的稀疏注意力机制。我们的半监督方法支持长达61,000个氨基酸（aa）的长序列。我们的评估表明，由此产生的嵌入明显超过了由单蛋白模型生成的嵌入，并且优于依赖静态屏蔽或非转换框架的替代大上下文架构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extending Protein Language Models to a Viral Genomic Scale Using Biologically Induced Sparse Attention.

The transformer architecture in deep learning has revolutionized protein sequence analysis. Recent advancements in protein language models have paved the way for significant progress across various domains, including protein function and structure prediction, multiple sequence alignments and mutation effect prediction. A protein language model is commonly trained on individual proteins, ignoring the interdependencies between sequences within a genome. However, biological understanding reveals that protein-protein interactions span entire genomic regions, underscoring the limitations of focusing solely on individual proteins. To address these limitations, we propose a novel approach that extends the context size of transformer models across the entire viral genome. By training on large genomic fragments, our method captures long-range interprotein interactions and encodes protein sequences with integrated information from distant proteins within the same genome, offering substantial benefits in various tasks. Viruses, with their densely packed genomes, minimal intergenic regions, and protein annotation challenges, are ideal candidates for genome-wide learning. We introduce a long-context protein language model, trained on entire viral genomes, leveraging a sparse attention mechanism based on protein-protein interactions. Our semi-supervised approach supports long sequences of up to 61,000 amino acids (aa). Our evaluations demonstrate that the resulting embeddings significantly surpass those generated by single-protein models and outperform alternative large-context architectures that rely on static masking or non-transformer frameworks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量