pSCAN: Fast and exact structural graph clustering

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI:10.1109/ICDE.2016.7498245

Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, W. Zhang

{"title":"pSCAN: Fast and exact structural graph clustering","authors":"Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, W. Zhang","doi":"10.1109/ICDE.2016.7498245","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"253-264"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"72","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498245","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 72

Abstract

In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.

查看原文本刊更多论文

pSCAN:快速和精确的结构图聚类

本文研究了结构图聚类问题，这是图数据管理和分析的一个基本问题。给定一个大的图G = (V, E)，结构图聚类就是将V中的顶点分配给聚类，并识别出枢纽点和离群点的集合，使同一聚类中的顶点相互紧密连接，而不同聚类中的顶点相互松散连接。首先，我们证明了现有的SCAN方法是最坏最优的。然而，它仍然不能扩展到大型图，因为每一对相邻的顶点都要耗尽计算结构相似性。其次，我们对结构图聚类进行了三个观察，为进一步优化提供了机会。在此基础上，本文提出了一种新的两步聚类方法。在此基础上，提出了一种减少结构相似性计算次数的新方法。此外，我们提出了优化技术来加快检查两个顶点是否彼此结构相似。最后，我们对大型真实图和合成图进行了广泛的性能研究，这表明我们的新方法比最先进的方法要好一个数量级以上。值得注意的是，对于拥有10亿条边的twitter图，我们的方法需要25分钟，而最先进的方法即使在24小时后也无法完成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量