GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning

arXiv - QuanBio - Biomolecules Pub Date : 2024-07-31 DOI:arxiv-2408.00057

Dan Kalifa, Uriel Singer, Kira Radinsky

{"title":"GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning","authors":"Dan Kalifa, Uriel Singer, Kira Radinsky","doi":"arxiv-2408.00057","DOIUrl":null,"url":null,"abstract":"Proteins play a vital role in biological processes and are indispensable for\nliving organisms. Accurate representation of proteins is crucial, especially in\ndrug development. Recently, there has been a notable increase in interest in\nutilizing machine learning and deep learning techniques for unsupervised\nlearning of protein representations. However, these approaches often focus\nsolely on the amino acid sequence of proteins and lack factual knowledge about\nproteins and their interactions, thus limiting their performance. In this\nstudy, we present GOProteinGNN, a novel architecture that enhances protein\nlanguage models by integrating protein knowledge graph information during the\ncreation of amino acid level representations. Our approach allows for the\nintegration of information at both the individual amino acid level and the\nentire protein level, enabling a comprehensive and effective learning process\nthrough graph-based learning. By doing so, we can capture complex relationships\nand dependencies between proteins and their functional annotations, resulting\nin more robust and contextually enriched protein representations. Unlike\nprevious fusion methods, GOProteinGNN uniquely learns the entire protein\nknowledge graph during training, which allows it to capture broader relational\nnuances and dependencies beyond mere triplets as done in previous work. We\nperform a comprehensive evaluation on several downstream tasks demonstrating\nthat GOProteinGNN consistently outperforms previous methods, showcasing its\neffectiveness and establishing it as a state-of-the-art solution for protein\nrepresentation learning.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"45 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Proteins play a vital role in biological processes and are indispensable for living organisms. Accurate representation of proteins is crucial, especially in drug development. Recently, there has been a notable increase in interest in utilizing machine learning and deep learning techniques for unsupervised learning of protein representations. However, these approaches often focus solely on the amino acid sequence of proteins and lack factual knowledge about proteins and their interactions, thus limiting their performance. In this study, we present GOProteinGNN, a novel architecture that enhances protein language models by integrating protein knowledge graph information during the creation of amino acid level representations. Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process through graph-based learning. By doing so, we can capture complex relationships and dependencies between proteins and their functional annotations, resulting in more robust and contextually enriched protein representations. Unlike previous fusion methods, GOProteinGNN uniquely learns the entire protein knowledge graph during training, which allows it to capture broader relational nuances and dependencies beyond mere triplets as done in previous work. We perform a comprehensive evaluation on several downstream tasks demonstrating that GOProteinGNN consistently outperforms previous methods, showcasing its effectiveness and establishing it as a state-of-the-art solution for protein representation learning.

查看原文本刊更多论文

GOProteinGNN：利用蛋白质知识图谱进行蛋白质表征学习

蛋白质在生物过程中起着至关重要的作用，是生物体不可或缺的物质。准确表示蛋白质至关重要，尤其是在药物开发中。最近，人们对利用机器学习和深度学习技术进行蛋白质表征的无监督学习的兴趣明显增加。然而，这些方法往往只关注蛋白质的氨基酸序列，缺乏有关蛋白质及其相互作用的事实知识，从而限制了它们的性能。在这项研究中，我们提出了 GOProteinGNN，这是一种新型架构，通过在创建氨基酸级表征时整合蛋白质知识图谱信息来增强蛋白质语言模型。我们的方法可以整合单个氨基酸水平和整个蛋白质水平的信息，通过基于图的学习实现全面有效的学习过程。通过这种方法，我们可以捕捉蛋白质及其功能注释之间的复杂关系和依赖性，从而获得更强大、上下文更丰富的蛋白质表征。与以前的融合方法不同，GOProteinGNN 在训练过程中独特地学习了整个蛋白质知识图谱，这使它能够捕捉到更广泛的关系和依赖性，而不仅仅是以前工作中的三元组。我们对多个下游任务进行了综合评估，结果表明 GOProteinGNN 的表现始终优于之前的方法，从而展示了它的有效性，并将其确立为蛋白质表征学习的最先进解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量