{"title":"Distributed Incremental Graph Analysis","authors":"Upa Gupta, L. Fegaras","doi":"10.1109/BigDataCongress.2016.18","DOIUrl":null,"url":null,"abstract":"Distributed frameworks, such as MapReduce and Spark, have been developed by industry and research groups to analyze the vast amount of data that is being generated on a daily basis. Many graphs of interest, such as the Web graph and Social Networks, increase their size daily at an unprecedented scale and rate. To cope with this vast amount of data, researchers have been using distributed processing frameworks to analyze these graphs extensively. Most of these graph algorithms are iterative in nature. In our previous work, we introduced an efficient design pattern to handle a family of iterative graph algorithms in a distributed framework. Unfortunately, in most of these iterative algorithms, such as for Page-Rank, if the graph is modified with the addition or deletion of edges or vertices, the Page-Rank has to be recomputed from scratch. In this paper, we are introducing an improved design pattern for such algorithms to handle graph updates in an incremental fashion. Our method is to separate the graph topology from the graph analysis results. At each iteration step, each node participating in this graph analysis task, in addition to reading a single graph partition, it reads all the current analysis results from the distributed file system (DFS). These results are correlated with the local graph partition using a special merge-join and the new improved analysis results are calculated and stored to the DFS, one partition from each worker node. To handle continuous updates, an update function collects the changes to the graph and applies them to the graph partitions in a streaming fashion. Once the changes are made, the iterative algorithm is resumed to process the new updated data. Since a large part of the graph analysis task has already been completed on the existing data, the new updates require fewer iterations to compute the new graph analysis results as the iterative algorithm will converge faster.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed frameworks, such as MapReduce and Spark, have been developed by industry and research groups to analyze the vast amount of data that is being generated on a daily basis. Many graphs of interest, such as the Web graph and Social Networks, increase their size daily at an unprecedented scale and rate. To cope with this vast amount of data, researchers have been using distributed processing frameworks to analyze these graphs extensively. Most of these graph algorithms are iterative in nature. In our previous work, we introduced an efficient design pattern to handle a family of iterative graph algorithms in a distributed framework. Unfortunately, in most of these iterative algorithms, such as for Page-Rank, if the graph is modified with the addition or deletion of edges or vertices, the Page-Rank has to be recomputed from scratch. In this paper, we are introducing an improved design pattern for such algorithms to handle graph updates in an incremental fashion. Our method is to separate the graph topology from the graph analysis results. At each iteration step, each node participating in this graph analysis task, in addition to reading a single graph partition, it reads all the current analysis results from the distributed file system (DFS). These results are correlated with the local graph partition using a special merge-join and the new improved analysis results are calculated and stored to the DFS, one partition from each worker node. To handle continuous updates, an update function collects the changes to the graph and applies them to the graph partitions in a streaming fashion. Once the changes are made, the iterative algorithm is resumed to process the new updated data. Since a large part of the graph analysis task has already been completed on the existing data, the new updates require fewer iterations to compute the new graph analysis results as the iterative algorithm will converge faster.