Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-04-03 DOI:10.1109/TPDS.2025.3557610

Haobin Tan;Yao Xiao;Amelie Chi Zhou;Kezhong Lu;Xuan Yang

{"title":"Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers","authors":"Haobin Tan;Yao Xiao;Amelie Chi Zhou;Kezhong Lu;Xuan Yang","doi":"10.1109/TPDS.2025.3557610","DOIUrl":null,"url":null,"abstract":"Graph partitioning is of great importance to optimizing the performance and cost of geo-distributed graph analytics applications. However, it is non-trivial to obtain efficient and effective partitioning due to the challenges brought by the <italic>large graph scales, <italic>dynamic graph changes and the <italic>network heterogeneity in geo-distributed data centers (DCs). Existing studies usually adopt heuristic-based methods to achieve fast and balanced partitioning for large graphs, which are not powerful enough to address the complexity in our problem. Further, graph structures of many applications can change at various frequencies. Dynamic partitioning methods usually focus on achieving low latency to quickly adapt to changes, which unfortunately sacrifices partitioning effectiveness. Also, such methods are not aware of the dynamicity of graphs and can over sacrifice effectiveness for unnecessarily low latency. To address the limitations of existing studies, we propose <italic>DistRLCut, a novel graph partitioner which leverages Multi-Agent Reinforcement Learning (MARL) to solve the complexity of the partitioning problem. To achieve fast partitioning for large graphs, <italic>DistRLCut adapts MARL to a distributed implementation which significantly accelerates the learning process. Further, <italic>DistRLCut incorporates two techniques to trade-off between partitioning effectiveness and efficiency, including local training and agent sampling. By adaptively tuning the number of local training iterations and the agent sampling rate, <italic>DistRLCut is able to achieve good partitioning results within an overhead constraint required by graph dynamicity. Experiments using real cloud DCs and real-world graphs show that, compared to state-of-the-art static partitioning methods, <italic>DistRLCut improves the performance of geo-distributed graph analytics by 11%-95%. <italic>DistRLCut can partition over 28 million edges per second, showcasing its scalability for large graphs. With varying graph changing frequencies, <italic>DistRLCut can improve the performance by up to 71% compared to state-of-the-art dynamic partitioning.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1161-1174"},"PeriodicalIF":5.6000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948369/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Graph partitioning is of great importance to optimizing the performance and cost of geo-distributed graph analytics applications. However, it is non-trivial to obtain efficient and effective partitioning due to the challenges brought by the large graph scales, dynamic graph changes and the network heterogeneity in geo-distributed data centers (DCs). Existing studies usually adopt heuristic-based methods to achieve fast and balanced partitioning for large graphs, which are not powerful enough to address the complexity in our problem. Further, graph structures of many applications can change at various frequencies. Dynamic partitioning methods usually focus on achieving low latency to quickly adapt to changes, which unfortunately sacrifices partitioning effectiveness. Also, such methods are not aware of the dynamicity of graphs and can over sacrifice effectiveness for unnecessarily low latency. To address the limitations of existing studies, we propose DistRLCut, a novel graph partitioner which leverages Multi-Agent Reinforcement Learning (MARL) to solve the complexity of the partitioning problem. To achieve fast partitioning for large graphs, DistRLCut adapts MARL to a distributed implementation which significantly accelerates the learning process. Further, DistRLCut incorporates two techniques to trade-off between partitioning effectiveness and efficiency, including local training and agent sampling. By adaptively tuning the number of local training iterations and the agent sampling rate, DistRLCut is able to achieve good partitioning results within an overhead constraint required by graph dynamicity. Experiments using real cloud DCs and real-world graphs show that, compared to state-of-the-art static partitioning methods, DistRLCut improves the performance of geo-distributed graph analytics by 11%-95%. DistRLCut can partition over 28 million edges per second, showcasing its scalability for large graphs. With varying graph changing frequencies, DistRLCut can improve the performance by up to 71% compared to state-of-the-art dynamic partitioning.

查看原文本刊更多论文

地理分布数据中心中大图的分布式和自适应分区

图划分对于优化地理分布式图分析应用程序的性能和成本具有重要意义。然而，由于地理分布式数据中心的大图规模、图的动态变化和网络的异构性给分区带来的挑战，实现高效、有效的分区并不是一件容易的事情。现有的研究通常采用基于启发式的方法来实现大型图的快速和平衡分区，这不足以解决我们问题的复杂性。此外，许多应用程序的图形结构可以在不同的频率下变化。动态分区方法通常侧重于实现低延迟以快速适应更改，不幸的是这会牺牲分区的有效性。此外，这些方法不知道图形的动态性，并且可能会为了不必要的低延迟而过度牺牲效率。为了解决现有研究的局限性，我们提出了一种新的图分区器DistRLCut，它利用多智能体强化学习（MARL）来解决分区问题的复杂性。为了实现大型图的快速分区，DistRLCut将MARL应用于分布式实现，这大大加快了学习过程。此外，DistRLCut结合了两种技术来权衡分区的有效性和效率，包括局部训练和代理抽样。通过自适应调整局部训练迭代次数和代理采样率，DistRLCut能够在图动态性所需的开销约束下获得良好的分区结果。使用真实云数据中心和真实图形的实验表明，与最先进的静态分区方法相比，DistRLCut将地理分布式图形分析的性能提高了11%-95%。DistRLCut每秒可以划分超过2800万条边，展示了它对大型图的可扩展性。与最先进的动态分区相比，使用不同的图形更改频率，DistRLCut可以将性能提高71%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.