Managing large dynamic graphs efficiently

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI:10.1145/2213836.2213854

J. Mondal, A. Deshpande

{"title":"Managing large dynamic graphs efficiently","authors":"J. Mondal, A. Deshpande","doi":"10.1145/2213836.2213854","DOIUrl":null,"url":null,"abstract":"There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node-centric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose aggressive replication of the nodes in the graph for supporting low-latency querying, and investigate three novel techniques to minimize the communication bandwidth and the storage requirements. First, we develop a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication. Second, we propose a clustering-based approach to amortize the costs of making these replication decisions. Finally, we propose using a fairness criterion to dictate how replication decisions should be made. We provide both theoretical analysis and efficient algorithms for the optimization problems that arise. We have implemented our framework as a middleware on top of the open-source CouchDB key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"114","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2213836.2213854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 114

Abstract

There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node-centric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose aggressive replication of the nodes in the graph for supporting low-latency querying, and investigate three novel techniques to minimize the communication bandwidth and the storage requirements. First, we develop a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication. Second, we propose a clustering-based approach to amortize the costs of making these replication decisions. Finally, we propose using a fairness criterion to dictate how replication decisions should be made. We provide both theoretical analysis and efficient algorithms for the optimization problems that arise. We have implemented our framework as a middleware on top of the open-source CouchDB key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.

查看原文本刊更多论文

有效地管理大型动态图形

在诸如社交网络、通信网络、生物网络等应用程序中，越来越需要摄取、管理和查询大量的图结构数据。图形数据库可以显式地推断数据的图形性质，支持灵活的模式以及以节点为中心或以边缘为中心的分析和查询，是存储此类数据的理想选择。然而，尽管在单站点图数据库和有效地执行大型图上不同类型的查询方面有很多工作，但迄今为止，在理解分布式图数据库中处理大规模此类数据所需的挑战方面的工作很少。在本文中，我们提出了一个内存中的分布式图形数据管理系统的设计，旨在管理大规模动态变化的图形，并支持对其进行低延迟查询处理。分布式图数据库面临的主要挑战是，在一组机器上对图进行分区必然会导致大量跨分区的分布式遍历，即使是为了回答简单的查询。为了支持低延迟查询，我们提出了图中节点的主动复制，并研究了三种新技术来最小化通信带宽和存储需求。首先，我们开发了一个混合复制策略，该策略监视节点读写频率，以动态地决定要复制哪些数据，以及是进行渴望复制还是惰性复制。其次，我们提出了一种基于集群的方法来分摊做出这些复制决策的成本。最后，我们建议使用公平标准来规定如何做出复制决策。我们为出现的优化问题提供理论分析和有效的算法。我们已经将框架实现为开源CouchDB键值存储之上的中间件。我们在社交图上评估了我们的系统，并表明我们的系统能够有效地处理非常大的图，并且它显著降低了网络带宽消耗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量