Large-scale frequent subgraph mining in MapReduce

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI:10.1109/ICDE.2014.6816705

Wenqing Lin, Xiaokui Xiao, Gabriel Ghinita

{"title":"Large-scale frequent subgraph mining in MapReduce","authors":"Wenqing Lin, Xiaokui Xiao, Gabriel Ghinita","doi":"10.1109/ICDE.2014.6816705","DOIUrl":null,"url":null,"abstract":"Mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"85","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2014.6816705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 85

Abstract

Mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs.

查看原文本刊更多论文

MapReduce中的大规模频繁子图挖掘

从大量的图对象中挖掘频繁子图是生物信息学、社交网络、计算机视觉等应用领域的一个重要问题。子图挖掘的主要挑战是效率，因为(i)对图同构的测试是计算密集型的，(ii)要挖掘的图集合的基数可能非常大。我们提出了一种适合于可扩展MapReduce计算模型中大规模并行化的两步过滤和细化方法。我们在工作节点之间划分图集合，每个工作节点应用筛选步骤来确定一组在其分区中局部频繁出现的候选子图。所有这些图的并集是细化步骤的输入，在此步骤中，每个候选图将根据所有分区进行检查，并且只保留全局频繁图。我们设计了一个统计阈值机制，使我们能够预测哪些子图有很高的机会成为全局频繁的，从而减少了细化步骤中的计算开销。我们还提出了有效的策略来避免候选图在每轮搜索时的冗余计算，以及轻量级的图压缩机制来减少机器之间的通信开销。在几个真实世界的大型图数据集上进行的大量实验评估结果表明，所提出的方法明显优于现有的最先进的方法，并为大规模图集的频繁子图挖掘问题提供了一个实用的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 30th International Conference on Data Engineering

自引率

0.00%

发文量