{"title":"Structure-guided sequence representation learning for generalizable protein function prediction.","authors":"SeokJun On, Yujin Jeong, Eun-Sol Kim","doi":"10.1093/bioinformatics/btaf511","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Accurately predicting protein function from sequence remains a fundamental yet challenging goal in computational biology. Although recent advances have enabled the reliable prediction of protein 3D structures from sequences, utilizing structural information alone for functional inference has shown limited success. To address this gap, previous work has explored the integration of sequence and structural data by representing proteins as graphs, where residues are modeled as nodes, and spatial proximity defines edges. However, since the number of amino acids can vary significantly between proteins, the resulting graphs, constructed based on amino acids, also differ greatly in size. This large variation poses a challenge, as it becomes extremely difficult to extract generalizable information from graphs of such differing scales accurately. In this work, we propose Structure-guided Sequence Representation Learning, a novel framework that incorporates structural knowledge to extract informative, multiscale features directly from protein sequences. By embedding structural information into a sequence-based learning paradigm, our method captures functionally meaningful representations more effectively. Furthermore, we present a generalizable model architecture designed for multitask learning and inference, offering improved performance and flexibility over traditional task-specific approaches to protein function prediction.</p><p><strong>Results: </strong>In this article, we demonstrate that the proposed novel attention pooling method on protein graphs effectively integrates global structural features and local chemical properties of amino acids in various-length proteins. Through this approach, we improve performance in tasks related to predicting protein functions, functional expression sites, and their relationships with structure and sequence. By effectively extracting the information needed to predict multiple protein functions simultaneously, we improve efficiency by eliminating the need for separate learning.</p><p><strong>Availability and implementation: </strong>The code implementation is available at https://github.com/vanha9/S2RL_protein and has also been archived on zenodo: https://doi.org/10.5281/zenodo.16441001.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12478692/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf511","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Motivation: Accurately predicting protein function from sequence remains a fundamental yet challenging goal in computational biology. Although recent advances have enabled the reliable prediction of protein 3D structures from sequences, utilizing structural information alone for functional inference has shown limited success. To address this gap, previous work has explored the integration of sequence and structural data by representing proteins as graphs, where residues are modeled as nodes, and spatial proximity defines edges. However, since the number of amino acids can vary significantly between proteins, the resulting graphs, constructed based on amino acids, also differ greatly in size. This large variation poses a challenge, as it becomes extremely difficult to extract generalizable information from graphs of such differing scales accurately. In this work, we propose Structure-guided Sequence Representation Learning, a novel framework that incorporates structural knowledge to extract informative, multiscale features directly from protein sequences. By embedding structural information into a sequence-based learning paradigm, our method captures functionally meaningful representations more effectively. Furthermore, we present a generalizable model architecture designed for multitask learning and inference, offering improved performance and flexibility over traditional task-specific approaches to protein function prediction.
Results: In this article, we demonstrate that the proposed novel attention pooling method on protein graphs effectively integrates global structural features and local chemical properties of amino acids in various-length proteins. Through this approach, we improve performance in tasks related to predicting protein functions, functional expression sites, and their relationships with structure and sequence. By effectively extracting the information needed to predict multiple protein functions simultaneously, we improve efficiency by eliminating the need for separate learning.
Availability and implementation: The code implementation is available at https://github.com/vanha9/S2RL_protein and has also been archived on zenodo: https://doi.org/10.5281/zenodo.16441001.