{"title":"大型网络中分而治之算法的图子抽样","authors":"Eric Yanchenko","doi":"arxiv-2409.06994","DOIUrl":null,"url":null,"abstract":"As networks continue to increase in size, current methods must be capable of\nhandling large numbers of nodes and edges in order to be practically relevant.\nInstead of working directly with the entire (large) network, analyzing\nsub-networks has become a popular approach. Due to a network's inherent\ninter-connectedness, sub-sampling is not a trivial task. While this problem has\ngained attention in recent years, it has not received sufficient attention from\nthe statistics community. In this work, we provide a thorough comparison of\nseven graph sub-sampling algorithms by applying them to divide-and-conquer\nalgorithms for community structure and core-periphery (CP) structure. After\ndiscussing the various algorithms and sub-sampling routines, we derive\ntheoretical results for the mis-classification rate of the divide-and-conquer\nalgorithm for CP structure under various sub-sampling schemes. We then perform\nextensive experiments on both simulated and real-world data to compare the\nvarious methods. For the community detection task, we found that sampling nodes\nuniformly at random yields the best performance. For CP structure on the other\nhand, there was no single winner, but algorithms which sampled core nodes at a\nhigher rate consistently outperformed other sampling routines, e.g., random\nedge sampling and random walk sampling. The varying performance of the sampling\nalgorithms on different tasks demonstrates the importance of carefully\nselecting a sub-sampling routine for the specific application.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Graph sub-sampling for divide-and-conquer algorithms in large networks\",\"authors\":\"Eric Yanchenko\",\"doi\":\"arxiv-2409.06994\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As networks continue to increase in size, current methods must be capable of\\nhandling large numbers of nodes and edges in order to be practically relevant.\\nInstead of working directly with the entire (large) network, analyzing\\nsub-networks has become a popular approach. Due to a network's inherent\\ninter-connectedness, sub-sampling is not a trivial task. While this problem has\\ngained attention in recent years, it has not received sufficient attention from\\nthe statistics community. In this work, we provide a thorough comparison of\\nseven graph sub-sampling algorithms by applying them to divide-and-conquer\\nalgorithms for community structure and core-periphery (CP) structure. After\\ndiscussing the various algorithms and sub-sampling routines, we derive\\ntheoretical results for the mis-classification rate of the divide-and-conquer\\nalgorithm for CP structure under various sub-sampling schemes. We then perform\\nextensive experiments on both simulated and real-world data to compare the\\nvarious methods. For the community detection task, we found that sampling nodes\\nuniformly at random yields the best performance. For CP structure on the other\\nhand, there was no single winner, but algorithms which sampled core nodes at a\\nhigher rate consistently outperformed other sampling routines, e.g., random\\nedge sampling and random walk sampling. The varying performance of the sampling\\nalgorithms on different tasks demonstrates the importance of carefully\\nselecting a sub-sampling routine for the specific application.\",\"PeriodicalId\":501215,\"journal\":{\"name\":\"arXiv - STAT - Computation\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06994\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Graph sub-sampling for divide-and-conquer algorithms in large networks
As networks continue to increase in size, current methods must be capable of
handling large numbers of nodes and edges in order to be practically relevant.
Instead of working directly with the entire (large) network, analyzing
sub-networks has become a popular approach. Due to a network's inherent
inter-connectedness, sub-sampling is not a trivial task. While this problem has
gained attention in recent years, it has not received sufficient attention from
the statistics community. In this work, we provide a thorough comparison of
seven graph sub-sampling algorithms by applying them to divide-and-conquer
algorithms for community structure and core-periphery (CP) structure. After
discussing the various algorithms and sub-sampling routines, we derive
theoretical results for the mis-classification rate of the divide-and-conquer
algorithm for CP structure under various sub-sampling schemes. We then perform
extensive experiments on both simulated and real-world data to compare the
various methods. For the community detection task, we found that sampling nodes
uniformly at random yields the best performance. For CP structure on the other
hand, there was no single winner, but algorithms which sampled core nodes at a
higher rate consistently outperformed other sampling routines, e.g., random
edge sampling and random walk sampling. The varying performance of the sampling
algorithms on different tasks demonstrates the importance of carefully
selecting a sub-sampling routine for the specific application.