Distributed Set Reachability

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI:10.1145/2882903.2915226

Sairam Gurajada, M. Theobald

{"title":"Distributed Set Reachability","authors":"Sairam Gurajada, M. Theobald","doi":"10.1145/2882903.2915226","DOIUrl":null,"url":null,"abstract":"In this paper, we focus on the efficient and scalable processing of set-reachability queries over a distributed, directed data graph. A \"set-reachability query\" is a generalized form of a reachability query, in which we consider two sets S and T of source and target vertices, respectively, to be given as the query. The result of a set-reachability query are all pairs of source and target vertices (s, t), with s -- S and t #8712; T, where s is reachable to t (denoted as S ↝ T). In case the data graph is partitioned into multiple, edge- and vertex-disjoint subgraphs (e.g., when distributed across multiple compute nodes in a cluster), we refer to the resulting set-reachability problem as \"distributed set reachability\". The key goal in processing a distributed set-reachability query over a partitioned data graph both efficiently and in a scalable manner is (1) to avoid redundant computations within the local compute nodes as much as possible, (2) to partially evaluate the local components of a set-reachability query S ↝ T among all compute nodes in parallel, and (3) to minimize both the size and number of messages exchanged among the compute nodes. Distributed set reachability has a plethora of applications in graph analytics and for query processing. The current W3C recommendation for SPARQL 1.1, for example, introduces a notion of \"labeled property paths\" which resolves to processing a form of generalized graph-pattern queries with set-reachability predicates. Moreover, analyzing dependencies among \"social-network communities\" inherently involves reachability checks between large sets of source and target vertices. Our experiments confirm very significant performance gains of our approach in comparison to state-of-the-art graph engines such as Giraph++, and over a variety of graph collections with up to 1.4 billion edges.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"28 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2915226","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

In this paper, we focus on the efficient and scalable processing of set-reachability queries over a distributed, directed data graph. A "set-reachability query" is a generalized form of a reachability query, in which we consider two sets S and T of source and target vertices, respectively, to be given as the query. The result of a set-reachability query are all pairs of source and target vertices (s, t), with s -- S and t #8712; T, where s is reachable to t (denoted as S ↝ T). In case the data graph is partitioned into multiple, edge- and vertex-disjoint subgraphs (e.g., when distributed across multiple compute nodes in a cluster), we refer to the resulting set-reachability problem as "distributed set reachability". The key goal in processing a distributed set-reachability query over a partitioned data graph both efficiently and in a scalable manner is (1) to avoid redundant computations within the local compute nodes as much as possible, (2) to partially evaluate the local components of a set-reachability query S ↝ T among all compute nodes in parallel, and (3) to minimize both the size and number of messages exchanged among the compute nodes. Distributed set reachability has a plethora of applications in graph analytics and for query processing. The current W3C recommendation for SPARQL 1.1, for example, introduces a notion of "labeled property paths" which resolves to processing a form of generalized graph-pattern queries with set-reachability predicates. Moreover, analyzing dependencies among "social-network communities" inherently involves reachability checks between large sets of source and target vertices. Our experiments confirm very significant performance gains of our approach in comparison to state-of-the-art graph engines such as Giraph++, and over a variety of graph collections with up to 1.4 billion edges.

查看原文本刊更多论文

分布式集可达性

在本文中，我们关注的是分布式、有向数据图上集可达性查询的高效和可扩展处理。“集-可达性查询”是可达性查询的一种广义形式，在可达性查询中，我们将源顶点和目标顶点的两个集合S和T分别作为查询。集合可达性查询的结果是源顶点和目标顶点(s, t)的所有对，其中s——s和t #8712;当数据图被划分为多个边和顶点不相交的子图时(例如分布在集群中的多个计算节点上)，我们将由此产生的集可达性问题称为“分布式集可达性”。在分区数据图上高效且可扩展地处理分布式集可达性查询的关键目标是:(1)尽可能避免本地计算节点内的冗余计算;(2)在所有计算节点之间并行地部分评估集可达性查询S的局部组件;(3)最小化计算节点之间交换的消息的大小和数量。分布式集可达性在图分析和查询处理中有大量的应用。例如，当前W3C对SPARQL 1.1的推荐引入了“标记属性路径”的概念，该概念解决了使用集合可达性谓词处理一种通用图形模式查询形式的问题。此外，分析“社交网络社区”之间的依赖关系本质上涉及到大型源点和目标点集之间的可达性检查。我们的实验证实，与最先进的图形引擎(如Giraph++)以及具有多达14亿个边的各种图形集合相比，我们的方法具有非常显著的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量