NScaleSpark: subgraph-centric graph analytics on Apache Spark

Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics Pub Date : 2016-07-01 DOI:10.1145/2980523.2980529

A. Quamar, A. Deshpande

{"title":"NScaleSpark: subgraph-centric graph analytics on Apache Spark","authors":"A. Quamar, A. Deshpande","doi":"10.1145/2980523.2980529","DOIUrl":null,"url":null,"abstract":"In this paper, we describe NScaleSpark, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NScaleSpark is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks -- these systems espouse a \"think-like-a-vertex\" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS, Grace, and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NScaleSpark is based on a \"think-like-a-subgraph\" paradigm (also recently called \"think-like-an-embedding\" [23]). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NScaleSpark builds upon our prior work on the NScale system [18], which was built on top of the Hadoop MapReduce system. We describe how we reimplemented NScale on the Apache Spark platform, the key challenges therein, and the design decisions we made. NScaleSpark uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2980523.2980529","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In this paper, we describe NScaleSpark, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NScaleSpark is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks -- these systems espouse a "think-like-a-vertex" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS, Grace, and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NScaleSpark is based on a "think-like-a-subgraph" paradigm (also recently called "think-like-an-embedding" [23]). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NScaleSpark builds upon our prior work on the NScale system [18], which was built on top of the Hadoop MapReduce system. We describe how we reimplemented NScale on the Apache Spark platform, the key challenges therein, and the design decisions we made. NScaleSpark uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.

查看原文本刊更多论文

NScaleSpark: Apache Spark上以子图为中心的图分析

本文描述了NScaleSpark，一个在Apache Spark平台上执行大规模分布式图分析任务的框架。NScaleSpark的动机是对在大型图数据集上执行丰富而复杂的分析任务越来越感兴趣。最近有很多关于以顶点为中心的图形编程框架的工作，用于执行此类分析任务——这些系统支持“像顶点一样思考”(TLV)范式，其中一些示例系统是Pregel, Apache Giraph, GPS, Grace和GraphX(构建在Apache Spark之上)。然而，TLV范式不适合许多复杂的图分析任务，这些任务通常需要处理在底层图中的邻域或子图上聚合的信息。相反，NScaleSpark基于“像子图一样思考”范式(最近也被称为“像嵌入一样思考”[23])。在这里，用户指定要对数据图的大量多跳邻域或子图执行计算。NScaleSpark建立在我们之前在NScale系统[18]上的工作基础上，该系统建立在Hadoop MapReduce系统之上。我们描述了如何在Apache Spark平台上重新实现NScale，其中的关键挑战，以及我们所做的设计决策。NScaleSpark使用一系列的RDD转换来提取相关的子图，并使用基于成本的优化器将其保存在分布式内存中。我们的内存图数据结构使大规模图的高效图计算成为可能。我们在几个真实世界数据集和应用程序上的实验结果显示，与GraphX和其他以顶点为中心的方法相比，性能和总成本有了数量级的提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics

自引率

0.00%

发文量