{"title":"GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning","authors":"Dan Kalifa, Uriel Singer, Kira Radinsky","doi":"arxiv-2408.00057","DOIUrl":null,"url":null,"abstract":"Proteins play a vital role in biological processes and are indispensable for\nliving organisms. Accurate representation of proteins is crucial, especially in\ndrug development. Recently, there has been a notable increase in interest in\nutilizing machine learning and deep learning techniques for unsupervised\nlearning of protein representations. However, these approaches often focus\nsolely on the amino acid sequence of proteins and lack factual knowledge about\nproteins and their interactions, thus limiting their performance. In this\nstudy, we present GOProteinGNN, a novel architecture that enhances protein\nlanguage models by integrating protein knowledge graph information during the\ncreation of amino acid level representations. Our approach allows for the\nintegration of information at both the individual amino acid level and the\nentire protein level, enabling a comprehensive and effective learning process\nthrough graph-based learning. By doing so, we can capture complex relationships\nand dependencies between proteins and their functional annotations, resulting\nin more robust and contextually enriched protein representations. Unlike\nprevious fusion methods, GOProteinGNN uniquely learns the entire protein\nknowledge graph during training, which allows it to capture broader relational\nnuances and dependencies beyond mere triplets as done in previous work. We\nperform a comprehensive evaluation on several downstream tasks demonstrating\nthat GOProteinGNN consistently outperforms previous methods, showcasing its\neffectiveness and establishing it as a state-of-the-art solution for protein\nrepresentation learning.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"45 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Proteins play a vital role in biological processes and are indispensable for
living organisms. Accurate representation of proteins is crucial, especially in
drug development. Recently, there has been a notable increase in interest in
utilizing machine learning and deep learning techniques for unsupervised
learning of protein representations. However, these approaches often focus
solely on the amino acid sequence of proteins and lack factual knowledge about
proteins and their interactions, thus limiting their performance. In this
study, we present GOProteinGNN, a novel architecture that enhances protein
language models by integrating protein knowledge graph information during the
creation of amino acid level representations. Our approach allows for the
integration of information at both the individual amino acid level and the
entire protein level, enabling a comprehensive and effective learning process
through graph-based learning. By doing so, we can capture complex relationships
and dependencies between proteins and their functional annotations, resulting
in more robust and contextually enriched protein representations. Unlike
previous fusion methods, GOProteinGNN uniquely learns the entire protein
knowledge graph during training, which allows it to capture broader relational
nuances and dependencies beyond mere triplets as done in previous work. We
perform a comprehensive evaluation on several downstream tasks demonstrating
that GOProteinGNN consistently outperforms previous methods, showcasing its
effectiveness and establishing it as a state-of-the-art solution for protein
representation learning.