Structure-guided sequence representation learning for generalizable protein function prediction.

IF 5.4

Bioinformatics (Oxford, England) Pub Date : 2025-09-01 DOI:10.1093/bioinformatics/btaf511

SeokJun On, Yujin Jeong, Eun-Sol Kim

{"title":"Structure-guided sequence representation learning for generalizable protein function prediction.","authors":"SeokJun On, Yujin Jeong, Eun-Sol Kim","doi":"10.1093/bioinformatics/btaf511","DOIUrl":null,"url":null,"abstract":"Motivation: Accurately predicting protein function from sequence remains a fundamental yet challenging goal in computational biology. Although recent advances have enabled the reliable prediction of protein 3D structures from sequences, utilizing structural information alone for functional inference has shown limited success. To address this gap, previous work has explored the integration of sequence and structural data by representing proteins as graphs, where residues are modeled as nodes, and spatial proximity defines edges. However, since the number of amino acids can vary significantly between proteins, the resulting graphs, constructed based on amino acids, also differ greatly in size. This large variation poses a challenge, as it becomes extremely difficult to extract generalizable information from graphs of such differing scales accurately. In this work, we propose Structure-guided Sequence Representation Learning, a novel framework that incorporates structural knowledge to extract informative, multiscale features directly from protein sequences. By embedding structural information into a sequence-based learning paradigm, our method captures functionally meaningful representations more effectively. Furthermore, we present a generalizable model architecture designed for multitask learning and inference, offering improved performance and flexibility over traditional task-specific approaches to protein function prediction.Results: In this article, we demonstrate that the proposed novel attention pooling method on protein graphs effectively integrates global structural features and local chemical properties of amino acids in various-length proteins. Through this approach, we improve performance in tasks related to predicting protein functions, functional expression sites, and their relationships with structure and sequence. By effectively extracting the information needed to predict multiple protein functions simultaneously, we improve efficiency by eliminating the need for separate learning.Availability and implementation: The code implementation is available at https://github.com/vanha9/S2RL_protein and has also been archived on zenodo: https://doi.org/10.5281/zenodo.16441001.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12478692/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf511","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Accurately predicting protein function from sequence remains a fundamental yet challenging goal in computational biology. Although recent advances have enabled the reliable prediction of protein 3D structures from sequences, utilizing structural information alone for functional inference has shown limited success. To address this gap, previous work has explored the integration of sequence and structural data by representing proteins as graphs, where residues are modeled as nodes, and spatial proximity defines edges. However, since the number of amino acids can vary significantly between proteins, the resulting graphs, constructed based on amino acids, also differ greatly in size. This large variation poses a challenge, as it becomes extremely difficult to extract generalizable information from graphs of such differing scales accurately. In this work, we propose Structure-guided Sequence Representation Learning, a novel framework that incorporates structural knowledge to extract informative, multiscale features directly from protein sequences. By embedding structural information into a sequence-based learning paradigm, our method captures functionally meaningful representations more effectively. Furthermore, we present a generalizable model architecture designed for multitask learning and inference, offering improved performance and flexibility over traditional task-specific approaches to protein function prediction.

Results: In this article, we demonstrate that the proposed novel attention pooling method on protein graphs effectively integrates global structural features and local chemical properties of amino acids in various-length proteins. Through this approach, we improve performance in tasks related to predicting protein functions, functional expression sites, and their relationships with structure and sequence. By effectively extracting the information needed to predict multiple protein functions simultaneously, we improve efficiency by eliminating the need for separate learning.

Availability and implementation: The code implementation is available at https://github.com/vanha9/S2RL_protein and has also been archived on zenodo: https://doi.org/10.5281/zenodo.16441001.

Abstract Image

查看原文本刊更多论文

结构导向序列表示学习用于可推广的蛋白质功能预测。

动机：从序列中准确预测蛋白质功能仍然是计算生物学中一个基本但具有挑战性的目标。尽管最近的进展已经能够从序列中可靠地预测蛋白质3D结构，但仅利用结构信息进行功能推断的成功程度有限。为了解决这一差距，以前的工作已经通过将蛋白质表示为图形来探索序列和结构数据的集成，其中残基被建模为节点，空间接近度定义边缘。然而，由于氨基酸的数量在不同的蛋白质之间会有很大的不同，因此基于氨基酸构建的结果图在大小上也有很大的不同。这种巨大的变化带来了挑战，因为从这种不同尺度的图形中准确地提取可推广的信息变得极其困难。在这项工作中，我们提出了结构引导序列表示学习（S2RL），这是一种结合结构知识的新框架，可以直接从蛋白质序列中提取信息丰富的多尺度特征。通过将结构信息嵌入到基于序列的学习范式中，我们的方法可以更有效地捕获功能上有意义的表示。此外，我们提出了一种用于多任务学习和推理的可推广模型架构，与传统的特定任务蛋白质功能预测方法相比，提供了更高的性能和灵活性。结果：在本文中，我们证明了新的蛋白质图注意力池方法有效地整合了不同长度蛋白质中氨基酸的全局结构特征和局部化学性质。通过这种方法，我们提高了与预测蛋白质功能、功能表达位点及其与结构和序列的关系相关的任务的性能。通过有效地提取同时预测多种蛋白质功能所需的信息，我们消除了单独学习的需要，从而提高了效率。可用性和实现：代码实现可以在https://github.com/vanha9/S2RL_protein上获得，也可以在zenodo: https://doi.org/10.5281/zenodo.16441001上存档。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量