Using Neo4j for Mining Protein Graphs: A Case Study

2015 26th International Workshop on Database and Expert Systems Applications (DEXA) Pub Date : 2015-09-01 DOI:10.1109/DEXA.2015.59

D. Hoksza, Jan Jelínek

{"title":"Using Neo4j for Mining Protein Graphs: A Case Study","authors":"D. Hoksza, Jan Jelínek","doi":"10.1109/DEXA.2015.59","DOIUrl":null,"url":null,"abstract":"Using graph databases becomes increasingly popular in domains where data can be modeled as a set of connected objects. Graph databases enable to query such data using graph-based queries in a relatively simple manner in comparison to the classical relational databases. In this paper, we show how one of the most popular graph databases, Neo4j, can be applied to the bioinformatics problem of protein-protein interface (PPI) identification. The goal of the PPI identification task is, given a protein structure, to identify amino acids which are responsible for binding of the structure to other proteins. Each protein structure consists of a set of amino acid molecules which can be conceived as a graph and multitude of methods for analysis of such protein graphs have been established. We introduce here a knowledge-based approach which can enhance the quality of these methods by utilizing existing protein structure knowledge stored in the Protein Data Bank (PDB). We show how to transform information about protein complexes from PDB into Neo4j where they can be stored as a set of independent protein graphs. The resulting graph database contains about 14 millions labeled nodes and 38 millions edges. In the PPI identification phase, this database is queried using exact subgraph matching and the results are aggregated to improve an existing PPI identification method. We show the pros and cons of using Neo4j for such endeavor with respect to the size of the database and complexity of the queries in comparison to using a relational database (Microsoft SQL Server). We conclude that using Neo4j is a viable option for specific, rather small, subgraph query types. However, we have encountered performance limitations, especially for larger query graphs in terms of number of edges.","PeriodicalId":239815,"journal":{"name":"2015 26th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 26th International Workshop on Database and Expert Systems Applications (DEXA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2015.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Using graph databases becomes increasingly popular in domains where data can be modeled as a set of connected objects. Graph databases enable to query such data using graph-based queries in a relatively simple manner in comparison to the classical relational databases. In this paper, we show how one of the most popular graph databases, Neo4j, can be applied to the bioinformatics problem of protein-protein interface (PPI) identification. The goal of the PPI identification task is, given a protein structure, to identify amino acids which are responsible for binding of the structure to other proteins. Each protein structure consists of a set of amino acid molecules which can be conceived as a graph and multitude of methods for analysis of such protein graphs have been established. We introduce here a knowledge-based approach which can enhance the quality of these methods by utilizing existing protein structure knowledge stored in the Protein Data Bank (PDB). We show how to transform information about protein complexes from PDB into Neo4j where they can be stored as a set of independent protein graphs. The resulting graph database contains about 14 millions labeled nodes and 38 millions edges. In the PPI identification phase, this database is queried using exact subgraph matching and the results are aggregated to improve an existing PPI identification method. We show the pros and cons of using Neo4j for such endeavor with respect to the size of the database and complexity of the queries in comparison to using a relational database (Microsoft SQL Server). We conclude that using Neo4j is a viable option for specific, rather small, subgraph query types. However, we have encountered performance limitations, especially for larger query graphs in terms of number of edges.

查看原文本刊更多论文

使用Neo4j挖掘蛋白质图:一个案例研究

在可以将数据建模为一组连接对象的领域中，使用图数据库变得越来越流行。与传统的关系数据库相比，图数据库支持使用基于图的查询以相对简单的方式查询此类数据。在本文中，我们展示了如何将最流行的图形数据库Neo4j应用于蛋白质-蛋白质界面(PPI)识别的生物信息学问题。PPI鉴定任务的目标是，给定一个蛋白质结构，鉴定负责将该结构与其他蛋白质结合的氨基酸。每个蛋白质结构都由一组氨基酸分子组成，这些氨基酸分子可以被想象成一个图，并且已经建立了许多分析这种蛋白质图的方法。本文介绍了一种基于知识的方法，该方法可以利用存储在蛋白质数据库(PDB)中的现有蛋白质结构知识来提高这些方法的质量。我们将展示如何将有关蛋白质复合物的信息从PDB转换为Neo4j，在Neo4j中，它们可以作为一组独立的蛋白质图存储。生成的图数据库包含大约1400万个标记节点和3800万条边。在PPI识别阶段，使用精确子图匹配对该数据库进行查询，并对结果进行汇总，以改进现有的PPI识别方法。我们展示了与使用关系数据库(Microsoft SQL Server)相比，在数据库的大小和查询的复杂性方面使用Neo4j的优点和缺点。我们得出结论，对于特定的、相当小的子图查询类型，使用Neo4j是一个可行的选择。然而，我们遇到了性能限制，特别是对于边数量较多的大型查询图。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 26th International Workshop on Database and Expert Systems Applications (DEXA)

自引率

0.00%

发文量