{"title":"Distributed Set Reachability","authors":"Sairam Gurajada, M. Theobald","doi":"10.1145/2882903.2915226","DOIUrl":null,"url":null,"abstract":"In this paper, we focus on the efficient and scalable processing of set-reachability queries over a distributed, directed data graph. A \"set-reachability query\" is a generalized form of a reachability query, in which we consider two sets S and T of source and target vertices, respectively, to be given as the query. The result of a set-reachability query are all pairs of source and target vertices (s, t), with s -- S and t #8712; T, where s is reachable to t (denoted as S ↝ T). In case the data graph is partitioned into multiple, edge- and vertex-disjoint subgraphs (e.g., when distributed across multiple compute nodes in a cluster), we refer to the resulting set-reachability problem as \"distributed set reachability\". The key goal in processing a distributed set-reachability query over a partitioned data graph both efficiently and in a scalable manner is (1) to avoid redundant computations within the local compute nodes as much as possible, (2) to partially evaluate the local components of a set-reachability query S ↝ T among all compute nodes in parallel, and (3) to minimize both the size and number of messages exchanged among the compute nodes. Distributed set reachability has a plethora of applications in graph analytics and for query processing. The current W3C recommendation for SPARQL 1.1, for example, introduces a notion of \"labeled property paths\" which resolves to processing a form of generalized graph-pattern queries with set-reachability predicates. Moreover, analyzing dependencies among \"social-network communities\" inherently involves reachability checks between large sets of source and target vertices. Our experiments confirm very significant performance gains of our approach in comparison to state-of-the-art graph engines such as Giraph++, and over a variety of graph collections with up to 1.4 billion edges.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"28 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2915226","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
In this paper, we focus on the efficient and scalable processing of set-reachability queries over a distributed, directed data graph. A "set-reachability query" is a generalized form of a reachability query, in which we consider two sets S and T of source and target vertices, respectively, to be given as the query. The result of a set-reachability query are all pairs of source and target vertices (s, t), with s -- S and t #8712; T, where s is reachable to t (denoted as S ↝ T). In case the data graph is partitioned into multiple, edge- and vertex-disjoint subgraphs (e.g., when distributed across multiple compute nodes in a cluster), we refer to the resulting set-reachability problem as "distributed set reachability". The key goal in processing a distributed set-reachability query over a partitioned data graph both efficiently and in a scalable manner is (1) to avoid redundant computations within the local compute nodes as much as possible, (2) to partially evaluate the local components of a set-reachability query S ↝ T among all compute nodes in parallel, and (3) to minimize both the size and number of messages exchanged among the compute nodes. Distributed set reachability has a plethora of applications in graph analytics and for query processing. The current W3C recommendation for SPARQL 1.1, for example, introduces a notion of "labeled property paths" which resolves to processing a form of generalized graph-pattern queries with set-reachability predicates. Moreover, analyzing dependencies among "social-network communities" inherently involves reachability checks between large sets of source and target vertices. Our experiments confirm very significant performance gains of our approach in comparison to state-of-the-art graph engines such as Giraph++, and over a variety of graph collections with up to 1.4 billion edges.