Graph sub-sampling for divide-and-conquer algorithms in large networks

Eric Yanchenko
{"title":"Graph sub-sampling for divide-and-conquer algorithms in large networks","authors":"Eric Yanchenko","doi":"arxiv-2409.06994","DOIUrl":null,"url":null,"abstract":"As networks continue to increase in size, current methods must be capable of\nhandling large numbers of nodes and edges in order to be practically relevant.\nInstead of working directly with the entire (large) network, analyzing\nsub-networks has become a popular approach. Due to a network's inherent\ninter-connectedness, sub-sampling is not a trivial task. While this problem has\ngained attention in recent years, it has not received sufficient attention from\nthe statistics community. In this work, we provide a thorough comparison of\nseven graph sub-sampling algorithms by applying them to divide-and-conquer\nalgorithms for community structure and core-periphery (CP) structure. After\ndiscussing the various algorithms and sub-sampling routines, we derive\ntheoretical results for the mis-classification rate of the divide-and-conquer\nalgorithm for CP structure under various sub-sampling schemes. We then perform\nextensive experiments on both simulated and real-world data to compare the\nvarious methods. For the community detection task, we found that sampling nodes\nuniformly at random yields the best performance. For CP structure on the other\nhand, there was no single winner, but algorithms which sampled core nodes at a\nhigher rate consistently outperformed other sampling routines, e.g., random\nedge sampling and random walk sampling. The varying performance of the sampling\nalgorithms on different tasks demonstrates the importance of carefully\nselecting a sub-sampling routine for the specific application.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, sub-sampling is not a trivial task. While this problem has gained attention in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.
大型网络中分而治之算法的图子抽样
随着网络规模的不断扩大,当前的方法必须能够处理大量的节点和边,才能具有实际意义。由于网络固有的相互连接性,子采样并非易事。虽然这个问题近年来受到了越来越多的关注,但却没有得到统计学界的足够重视。在这项工作中,我们通过将七种图子采样算法应用于群落结构和核心-外围(CP)结构的分而治之算法,对它们进行了全面的比较。在讨论了各种算法和子采样例程之后,我们得出了在各种子采样方案下,CP 结构的分而萃算法误分类率的理论结果。然后,我们在模拟数据和实际数据上进行了大量实验,对各种方法进行了比较。对于群落检测任务,我们发现随机均匀采样节点的性能最好。另一方面,在 CP 结构方面,虽然没有单一的优胜者,但以更高的速率对核心节点进行采样的算法始终优于其他采样程序,例如随机边缘采样和随机漫步采样。采样算法在不同任务上的不同表现表明,针对特定应用仔细选择子采样例程非常重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信