{"title":"基于分布式内存异构系统的社区检测研究","authors":"Nitin Gawande , Sayan Ghosh , Mahantesh Halappanavar , Antonino Tumeo , Ananth Kalyanaraman","doi":"10.1016/j.parco.2022.102898","DOIUrl":null,"url":null,"abstract":"<div><p>In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as <em>communities</em> or <em>clusters</em> such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using <em>modularity</em>. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the <em>Louvain</em> method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.</p><p>Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.</p><p>In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2<span><math><mo>×</mo></math></span> speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19<span><math><mo>×</mo></math></span> speedup relative to the NVIDIA RAPIDS® <span>cuGraph</span>® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102898"},"PeriodicalIF":2.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000060/pdfft?md5=af2c328e8814f291f58460d2c8138c36&pid=1-s2.0-S0167819122000060-main.pdf","citationCount":"2","resultStr":"{\"title\":\"Towards scaling community detection on distributed-memory heterogeneous systems\",\"authors\":\"Nitin Gawande , Sayan Ghosh , Mahantesh Halappanavar , Antonino Tumeo , Ananth Kalyanaraman\",\"doi\":\"10.1016/j.parco.2022.102898\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as <em>communities</em> or <em>clusters</em> such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using <em>modularity</em>. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the <em>Louvain</em> method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.</p><p>Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.</p><p>In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2<span><math><mo>×</mo></math></span> speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19<span><math><mo>×</mo></math></span> speedup relative to the NVIDIA RAPIDS® <span>cuGraph</span>® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.</p></div>\",\"PeriodicalId\":54642,\"journal\":{\"name\":\"Parallel Computing\",\"volume\":\"111 \",\"pages\":\"Article 102898\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167819122000060/pdfft?md5=af2c328e8814f291f58460d2c8138c36&pid=1-s2.0-S0167819122000060-main.pdf\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Parallel Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167819122000060\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819122000060","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Towards scaling community detection on distributed-memory heterogeneous systems
In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as communities or clusters such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using modularity. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the Louvain method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.
Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.
In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2 speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19 speedup relative to the NVIDIA RAPIDS® cuGraph® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.
期刊介绍:
Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems.
Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results.
Particular technical areas of interest include, but are not limited to:
-System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing).
-Enabling software including debuggers, performance tools, and system and numeric libraries.
-General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems
-Software engineering and productivity as it relates to parallel computing
-Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism
-Performance measurement results on state-of-the-art systems
-Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures.
-Parallel I/O systems both hardware and software
-Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications