大型网络中分而治之算法的图子抽样

arXiv - STAT - Computation Pub Date : 2024-09-11 DOI:arxiv-2409.06994

Eric Yanchenko

{"title":"大型网络中分而治之算法的图子抽样","authors":"Eric Yanchenko","doi":"arxiv-2409.06994","DOIUrl":null,"url":null,"abstract":"As networks continue to increase in size, current methods must be capable of\nhandling large numbers of nodes and edges in order to be practically relevant.\nInstead of working directly with the entire (large) network, analyzing\nsub-networks has become a popular approach. Due to a network's inherent\ninter-connectedness, sub-sampling is not a trivial task. While this problem has\ngained attention in recent years, it has not received sufficient attention from\nthe statistics community. In this work, we provide a thorough comparison of\nseven graph sub-sampling algorithms by applying them to divide-and-conquer\nalgorithms for community structure and core-periphery (CP) structure. After\ndiscussing the various algorithms and sub-sampling routines, we derive\ntheoretical results for the mis-classification rate of the divide-and-conquer\nalgorithm for CP structure under various sub-sampling schemes. We then perform\nextensive experiments on both simulated and real-world data to compare the\nvarious methods. For the community detection task, we found that sampling nodes\nuniformly at random yields the best performance. For CP structure on the other\nhand, there was no single winner, but algorithms which sampled core nodes at a\nhigher rate consistently outperformed other sampling routines, e.g., random\nedge sampling and random walk sampling. The varying performance of the sampling\nalgorithms on different tasks demonstrates the importance of carefully\nselecting a sub-sampling routine for the specific application.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Graph sub-sampling for divide-and-conquer algorithms in large networks\",\"authors\":\"Eric Yanchenko\",\"doi\":\"arxiv-2409.06994\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As networks continue to increase in size, current methods must be capable of\\nhandling large numbers of nodes and edges in order to be practically relevant.\\nInstead of working directly with the entire (large) network, analyzing\\nsub-networks has become a popular approach. Due to a network's inherent\\ninter-connectedness, sub-sampling is not a trivial task. While this problem has\\ngained attention in recent years, it has not received sufficient attention from\\nthe statistics community. In this work, we provide a thorough comparison of\\nseven graph sub-sampling algorithms by applying them to divide-and-conquer\\nalgorithms for community structure and core-periphery (CP) structure. After\\ndiscussing the various algorithms and sub-sampling routines, we derive\\ntheoretical results for the mis-classification rate of the divide-and-conquer\\nalgorithm for CP structure under various sub-sampling schemes. We then perform\\nextensive experiments on both simulated and real-world data to compare the\\nvarious methods. For the community detection task, we found that sampling nodes\\nuniformly at random yields the best performance. For CP structure on the other\\nhand, there was no single winner, but algorithms which sampled core nodes at a\\nhigher rate consistently outperformed other sampling routines, e.g., random\\nedge sampling and random walk sampling. The varying performance of the sampling\\nalgorithms on different tasks demonstrates the importance of carefully\\nselecting a sub-sampling routine for the specific application.\",\"PeriodicalId\":501215,\"journal\":{\"name\":\"arXiv - STAT - Computation\",\"volume\":\"30 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06994\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着网络规模的不断扩大，当前的方法必须能够处理大量的节点和边，才能具有实际意义。由于网络固有的相互连接性，子采样并非易事。虽然这个问题近年来受到了越来越多的关注，但却没有得到统计学界的足够重视。在这项工作中，我们通过将七种图子采样算法应用于群落结构和核心-外围（CP）结构的分而治之算法，对它们进行了全面的比较。在讨论了各种算法和子采样例程之后，我们得出了在各种子采样方案下，CP 结构的分而萃算法误分类率的理论结果。然后，我们在模拟数据和实际数据上进行了大量实验，对各种方法进行了比较。对于群落检测任务，我们发现随机均匀采样节点的性能最好。另一方面，在 CP 结构方面，虽然没有单一的优胜者，但以更高的速率对核心节点进行采样的算法始终优于其他采样程序，例如随机边缘采样和随机漫步采样。采样算法在不同任务上的不同表现表明，针对特定应用仔细选择子采样例程非常重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Graph sub-sampling for divide-and-conquer algorithms in large networks

As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, sub-sampling is not a trivial task. While this problem has gained attention in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - STAT - Computation

自引率

0.00%

发文量