NScaleSpark: subgraph-centric graph analytics on Apache Spark

A. Quamar, A. Deshpande
{"title":"NScaleSpark: subgraph-centric graph analytics on Apache Spark","authors":"A. Quamar, A. Deshpande","doi":"10.1145/2980523.2980529","DOIUrl":null,"url":null,"abstract":"In this paper, we describe NScaleSpark, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NScaleSpark is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks -- these systems espouse a \"think-like-a-vertex\" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS, Grace, and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NScaleSpark is based on a \"think-like-a-subgraph\" paradigm (also recently called \"think-like-an-embedding\" [23]). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NScaleSpark builds upon our prior work on the NScale system [18], which was built on top of the Hadoop MapReduce system. We describe how we reimplemented NScale on the Apache Spark platform, the key challenges therein, and the design decisions we made. NScaleSpark uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2980523.2980529","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

In this paper, we describe NScaleSpark, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NScaleSpark is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks -- these systems espouse a "think-like-a-vertex" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS, Grace, and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NScaleSpark is based on a "think-like-a-subgraph" paradigm (also recently called "think-like-an-embedding" [23]). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NScaleSpark builds upon our prior work on the NScale system [18], which was built on top of the Hadoop MapReduce system. We describe how we reimplemented NScale on the Apache Spark platform, the key challenges therein, and the design decisions we made. NScaleSpark uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.
NScaleSpark: Apache Spark上以子图为中心的图分析
本文描述了NScaleSpark,一个在Apache Spark平台上执行大规模分布式图分析任务的框架。NScaleSpark的动机是对在大型图数据集上执行丰富而复杂的分析任务越来越感兴趣。最近有很多关于以顶点为中心的图形编程框架的工作,用于执行此类分析任务——这些系统支持“像顶点一样思考”(TLV)范式,其中一些示例系统是Pregel, Apache Giraph, GPS, Grace和GraphX(构建在Apache Spark之上)。然而,TLV范式不适合许多复杂的图分析任务,这些任务通常需要处理在底层图中的邻域或子图上聚合的信息。相反,NScaleSpark基于“像子图一样思考”范式(最近也被称为“像嵌入一样思考”[23])。在这里,用户指定要对数据图的大量多跳邻域或子图执行计算。NScaleSpark建立在我们之前在NScale系统[18]上的工作基础上,该系统建立在Hadoop MapReduce系统之上。我们描述了如何在Apache Spark平台上重新实现NScale,其中的关键挑战,以及我们所做的设计决策。NScaleSpark使用一系列的RDD转换来提取相关的子图,并使用基于成本的优化器将其保存在分布式内存中。我们的内存图数据结构使大规模图的高效图计算成为可能。我们在几个真实世界数据集和应用程序上的实验结果显示,与GraphX和其他以顶点为中心的方法相比,性能和总成本有了数量级的提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信