Scalable big graph processing in MapReduce

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI:10.1145/2588555.2593661

Lu Qin, J. Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin

{"title":"Scalable big graph processing in MapReduce","authors":"Lu Qin, J. Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin","doi":"10.1145/2588555.2593661","DOIUrl":null,"url":null,"abstract":"MapReduce has become one of the most popular parallel computing paradigms in cloud, due to its high scalability, reliability, and fault-tolerance achieved for a large variety of applications in big data processing. In the literature, there are MapReduce Class MRC and Minimal MapReduce Class MMC to define the memory consumption, communication cost, CPU cost, and number of MapReduce rounds for an algorithm to execute in MapReduce. However, neither of them is designed for big graph processing in MapReduce, since the constraints in MMC can be hardly achieved simultaneously on graphs and the conditions in MRC may induce scalability problems when processing big graph data. In this paper, we study scalable big graph processing in MapReduce. We introduce a Scalable Graph processing Class SGC by relaxing some constraints in MMC to make it suitable for scalable graph processing. We define two graph join operators in SGC, namely, EN join and NE join, using which a wide range of graph algorithms can be designed, including PageRank, breadth first search, graph keyword search, Connected Component (CC) computation, and Minimum Spanning Forest (MSF) computation. Remarkably, to the best of our knowledge, for the two fundamental graph problems CC and MSF computation, this is the first work that can achieve O(log(n)) MapReduce rounds with $O(n+m)$ total communication cost in each round and constant memory consumption on each machine, where $n$ and $m$ are the number of nodes and edges in the graph respectively. We conducted extensive performance studies using two web-scale graphs Twitter and Friendster with different graph characteristics. The experimental results demonstrate that our algorithms can achieve high scalability in big graph processing.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"79","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2588555.2593661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 79

Abstract

MapReduce has become one of the most popular parallel computing paradigms in cloud, due to its high scalability, reliability, and fault-tolerance achieved for a large variety of applications in big data processing. In the literature, there are MapReduce Class MRC and Minimal MapReduce Class MMC to define the memory consumption, communication cost, CPU cost, and number of MapReduce rounds for an algorithm to execute in MapReduce. However, neither of them is designed for big graph processing in MapReduce, since the constraints in MMC can be hardly achieved simultaneously on graphs and the conditions in MRC may induce scalability problems when processing big graph data. In this paper, we study scalable big graph processing in MapReduce. We introduce a Scalable Graph processing Class SGC by relaxing some constraints in MMC to make it suitable for scalable graph processing. We define two graph join operators in SGC, namely, EN join and NE join, using which a wide range of graph algorithms can be designed, including PageRank, breadth first search, graph keyword search, Connected Component (CC) computation, and Minimum Spanning Forest (MSF) computation. Remarkably, to the best of our knowledge, for the two fundamental graph problems CC and MSF computation, this is the first work that can achieve O(log(n)) MapReduce rounds with $O(n+m)$ total communication cost in each round and constant memory consumption on each machine, where $n$ and $m$ are the number of nodes and edges in the graph respectively. We conducted extensive performance studies using two web-scale graphs Twitter and Friendster with different graph characteristics. The experimental results demonstrate that our algorithms can achieve high scalability in big graph processing.

查看原文本刊更多论文

MapReduce中可伸缩的大图形处理

MapReduce由于其高可扩展性、高可靠性和高容错能力，在大数据处理中实现了大量的应用，已经成为云计算中最流行的并行计算范式之一。在文献中，有MapReduce类MRC和最小MapReduce类MMC来定义一个算法在MapReduce中执行的内存消耗、通信成本、CPU成本和MapReduce轮数。然而，它们都不是为MapReduce中的大图处理而设计的，因为MMC中的约束很难在图上同时实现，并且MRC中的条件在处理大图数据时可能会导致可扩展性问题。本文主要研究MapReduce中可扩展的大图处理。通过放宽MMC中的一些约束，引入了一个可扩展图处理类SGC，使其适合于可扩展图处理。我们在SGC中定义了两个图连接算子，即EN连接和NE连接，使用它们可以设计广泛的图算法，包括PageRank、广度优先搜索、图关键字搜索、连接组件(CC)计算和最小生成森林(MSF)计算。值得注意的是，据我们所知，对于两个基本的图问题CC和MSF计算，这是第一个可以实现O(log(n))次MapReduce轮，每轮总通信成本为$O(n+m)$，每台机器上的内存消耗不变的工作，其中$n$和$m$分别是图中的节点和边的数量。我们使用两个具有不同图形特征的网络规模图形Twitter和Friendster进行了广泛的性能研究。实验结果表明，该算法在大图形处理中具有较高的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量