Distributed Incremental Graph Analysis

2016 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2016-06-01 DOI:10.1109/BigDataCongress.2016.18

Upa Gupta, L. Fegaras

{"title":"Distributed Incremental Graph Analysis","authors":"Upa Gupta, L. Fegaras","doi":"10.1109/BigDataCongress.2016.18","DOIUrl":null,"url":null,"abstract":"Distributed frameworks, such as MapReduce and Spark, have been developed by industry and research groups to analyze the vast amount of data that is being generated on a daily basis. Many graphs of interest, such as the Web graph and Social Networks, increase their size daily at an unprecedented scale and rate. To cope with this vast amount of data, researchers have been using distributed processing frameworks to analyze these graphs extensively. Most of these graph algorithms are iterative in nature. In our previous work, we introduced an efficient design pattern to handle a family of iterative graph algorithms in a distributed framework. Unfortunately, in most of these iterative algorithms, such as for Page-Rank, if the graph is modified with the addition or deletion of edges or vertices, the Page-Rank has to be recomputed from scratch. In this paper, we are introducing an improved design pattern for such algorithms to handle graph updates in an incremental fashion. Our method is to separate the graph topology from the graph analysis results. At each iteration step, each node participating in this graph analysis task, in addition to reading a single graph partition, it reads all the current analysis results from the distributed file system (DFS). These results are correlated with the local graph partition using a special merge-join and the new improved analysis results are calculated and stored to the DFS, one partition from each worker node. To handle continuous updates, an update function collects the changes to the graph and applies them to the graph partitions in a streaming fashion. Once the changes are made, the iterative algorithm is resumed to process the new updated data. Since a large part of the graph analysis task has already been completed on the existing data, the new updates require fewer iterations to compute the new graph analysis results as the iterative algorithm will converge faster.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed frameworks, such as MapReduce and Spark, have been developed by industry and research groups to analyze the vast amount of data that is being generated on a daily basis. Many graphs of interest, such as the Web graph and Social Networks, increase their size daily at an unprecedented scale and rate. To cope with this vast amount of data, researchers have been using distributed processing frameworks to analyze these graphs extensively. Most of these graph algorithms are iterative in nature. In our previous work, we introduced an efficient design pattern to handle a family of iterative graph algorithms in a distributed framework. Unfortunately, in most of these iterative algorithms, such as for Page-Rank, if the graph is modified with the addition or deletion of edges or vertices, the Page-Rank has to be recomputed from scratch. In this paper, we are introducing an improved design pattern for such algorithms to handle graph updates in an incremental fashion. Our method is to separate the graph topology from the graph analysis results. At each iteration step, each node participating in this graph analysis task, in addition to reading a single graph partition, it reads all the current analysis results from the distributed file system (DFS). These results are correlated with the local graph partition using a special merge-join and the new improved analysis results are calculated and stored to the DFS, one partition from each worker node. To handle continuous updates, an update function collects the changes to the graph and applies them to the graph partitions in a streaming fashion. Once the changes are made, the iterative algorithm is resumed to process the new updated data. Since a large part of the graph analysis task has already been completed on the existing data, the new updates require fewer iterations to compute the new graph analysis results as the iterative algorithm will converge faster.

查看原文本刊更多论文

分布式增量图分析

分布式框架，如MapReduce和Spark，已经由行业和研究小组开发，用于分析每天生成的大量数据。许多有趣的图表，如Web图表和Social Networks，每天都在以前所未有的规模和速度增长。为了处理如此大量的数据，研究人员一直在使用分布式处理框架来广泛地分析这些图表。大多数图算法本质上都是迭代的。在我们之前的工作中，我们介绍了一种高效的设计模式来处理分布式框架中的一系列迭代图算法。不幸的是，在大多数迭代算法中，如Page-Rank，如果图被添加或删除边或顶点修改，Page-Rank必须从头开始重新计算。在本文中，我们将为这种算法引入一种改进的设计模式，以增量方式处理图形更新。我们的方法是将图拓扑从图分析结果中分离出来。在每个迭代步骤中，参与此图分析任务的每个节点，除了读取单个图分区外，还读取来自分布式文件系统(DFS)的所有当前分析结果。这些结果使用特殊的合并连接与本地图分区相关联，并且计算新的改进分析结果并将其存储到DFS中，每个工作节点一个分区。为了处理连续更新，更新函数收集对图的更改，并以流方式将其应用于图分区。一旦进行了更改，就恢复迭代算法来处理新的更新数据。由于大部分图分析任务已经在现有数据上完成，新的更新需要更少的迭代来计算新的图分析结果，迭代算法收敛更快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Congress on Big Data (BigData Congress)

自引率

0.00%

发文量