CoDiS: Community Detection via Distributed Seed Set Expansion on Graph Streams

IF 2.9 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information (Switzerland) Pub Date : 2023-11-01 DOI:10.3390/info14110594

Austin Anderson, Petros Potikas, Katerina Potika

{"title":"CoDiS: Community Detection via Distributed Seed Set Expansion on Graph Streams","authors":"Austin Anderson, Petros Potikas, Katerina Potika","doi":"10.3390/info14110594","DOIUrl":null,"url":null,"abstract":"Community detection has been (and remains) a very important topic in several fields. From marketing and social networking to biological studies, community detection plays a key role in advancing research in many different fields. Research on this topic originally looked at classifying nodes into discrete communities (non-overlapping communities) but eventually moved forward to placing nodes in multiple communities (overlapping communities). Unfortunately, community detection has always been a time-inefficient process, and datasets are too large to realistically process them using traditional methods. Because of this, recent methods have turned to parallelism and graph stream models, where the edge list is accessed one edge at a time. However, all these methods, while offering a significant decrease in processing time, still have several shortcomings. We propose a new parallel algorithm called community detection with seed sets (CoDiS), which solves the overlapping community detection problem in graph streams. Initially, some nodes (seed sets) have known community structures, and the aim is to expand these communities by processing one edge at a time. The innovation of our approach is that it splits communities among the parallel computation workers so that each worker is only updating a subset of all the communities. By doing so, we decrease the edge processing throughput and decrease the amount of time each worker spends on each edge. Crucially, we remove the need for every worker to have access to every community. Experimental results show that we are able to gain a significant improvement in running time with no loss of accuracy.","PeriodicalId":38479,"journal":{"name":"Information (Switzerland)","volume":"210 1","pages":"0"},"PeriodicalIF":2.9000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information (Switzerland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/info14110594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Community detection has been (and remains) a very important topic in several fields. From marketing and social networking to biological studies, community detection plays a key role in advancing research in many different fields. Research on this topic originally looked at classifying nodes into discrete communities (non-overlapping communities) but eventually moved forward to placing nodes in multiple communities (overlapping communities). Unfortunately, community detection has always been a time-inefficient process, and datasets are too large to realistically process them using traditional methods. Because of this, recent methods have turned to parallelism and graph stream models, where the edge list is accessed one edge at a time. However, all these methods, while offering a significant decrease in processing time, still have several shortcomings. We propose a new parallel algorithm called community detection with seed sets (CoDiS), which solves the overlapping community detection problem in graph streams. Initially, some nodes (seed sets) have known community structures, and the aim is to expand these communities by processing one edge at a time. The innovation of our approach is that it splits communities among the parallel computation workers so that each worker is only updating a subset of all the communities. By doing so, we decrease the edge processing throughput and decrease the amount of time each worker spends on each edge. Crucially, we remove the need for every worker to have access to every community. Experimental results show that we are able to gain a significant improvement in running time with no loss of accuracy.

查看原文本刊更多论文

CoDiS:基于图流上分布式种子集展开的社区检测

社区检测一直是(并且仍然是)几个领域的一个非常重要的主题。从市场营销和社会网络到生物学研究，社区检测在推进许多不同领域的研究中发挥着关键作用。关于该主题的研究最初着眼于将节点分类到离散社区(非重叠社区)，但最终将节点放在多个社区(重叠社区)中。不幸的是，社区检测一直是一个时间效率低下的过程，而且数据集太大，无法使用传统方法实际处理它们。正因为如此，最近的方法已经转向并行和图流模型，其中边列表一次访问一条边。然而，所有这些方法在显著减少处理时间的同时，仍然有一些缺点。提出了一种基于种子集的社区检测算法(CoDiS)，解决了图流中的重叠社区检测问题。最初，一些节点(种子集)有已知的社区结构，目的是通过一次处理一个边来扩展这些社区。我们方法的创新之处在于，它在并行计算工作者之间划分社区，这样每个工作者只更新所有社区的一个子集。通过这样做，我们减少了边缘处理吞吐量并减少了每个工作人员在每个边缘上花费的时间。至关重要的是，我们消除了每个工人都需要进入每个社区的需求。实验结果表明，我们能够在不损失精度的情况下显著提高运行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊