arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第7页

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization 涡旋：通过硬件感知的策略空间分层实现高效的无采样动态张量程序优化

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-02 DOI: arxiv-2409.01075

Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng

{"title":"Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization","authors":"Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng","doi":"arxiv-2409.01075","DOIUrl":"https://doi.org/arxiv-2409.01075","url":null,"abstract":"Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting\u0000attention for their ability to handle variable input sizes in real-time\u0000applications. However, existing compilation optimization methods for such\u0000networks often rely heavily on predefined samples to guide the compilation\u0000process, which restricts their adaptability and efficiency. These sample-driven\u0000methods struggle to efficiently manage the diverse and unpredictable shapes\u0000encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and\u0000sample-free compiler tailored for dynamic-shape tensor programs. Vortex\u0000capitalizes on detailed hardware information and hierarchizes the strategy\u0000space to facilitate high-performance code generation without relying on runtime\u0000shape samples. It features a unique bidirectional compilation workflow,\u0000combining top-down abstraction for aligning tensor program execution with\u0000hardware hierarchies and bottom-up kernel construction to narrow the search\u0000space, enabling Vortex to achieve remarkable efficiency. Comprehensive\u0000evaluations confirm that Vortex reduces compilation time by $176times$\u0000compared to the existing dynamic-shape compiler. Additionally, it substantially\u0000outperforms existing vendor-provided libraries and dynamic-shape compilers on\u0000both CPU and GPU platforms, delivering speedups of $2.53times$ and\u0000$3.01times$, respectively.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How local constraints influence network diameter and applications to LCL generalizations 局部约束如何影响网络直径以及在 LCL 概括中的应用

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-02 DOI: arxiv-2409.01305

Nicolas Bousquet, Laurent Feuilloley, Théo Pierron

引用次数: 0

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs LuWu：分布式 GPU 上用于 100B 级网络模型数据并行训练的端到端网络内核外优化器

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-02 DOI: arxiv-2409.00918

Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang

{"title":"LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs","authors":"Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang","doi":"arxiv-2409.00918","DOIUrl":"https://doi.org/arxiv-2409.00918","url":null,"abstract":"The recent progress made in large language models (LLMs) has brought\u0000tremendous application prospects to the world. The growing model size demands\u0000LLM training on multiple GPUs, while data parallelism is the most popular\u0000distributed training strategy due to its simplicity, efficiency, and\u0000scalability. Current systems adopt the model-sharded data parallelism to enable\u0000memory-efficient training, however, existing model-sharded data-parallel\u0000systems fail to efficiently utilize GPU on a commodity GPU cluster with 100\u0000Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between\u0000collective operation and GPU computation and 2) heavy CPU optimizer overhead.\u0000Recent works propose in-network aggregation (INA) to relieve the network\u0000bandwidth pressure in data-parallel training, but they are incompatible with\u0000model sharding due to the network design. To this end, we propose LuWu, a novel\u0000in-network optimizer that enables efficient model-in-network data-parallel\u0000training of a 100B-scale model on distributed GPUs. Such new data-parallel\u0000paradigm keeps a similar communication pattern as model-sharded data\u0000parallelism but with a centralized in-network optimizer execution. The key idea\u0000is to offload the entire optimizer states and parameters from GPU workers onto\u0000an in-network optimizer node and to offload the entire collective communication\u0000from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The\u0000experimental results show that LuWu outperforms the state-of-the-art training\u0000system by 3.98x when training on a 175B model on an 8-worker cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment FlashFlex：适应异构环境下的大型语言模型训练

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-02 DOI: arxiv-2409.01143

Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan

{"title":"FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment","authors":"Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan","doi":"arxiv-2409.01143","DOIUrl":"https://doi.org/arxiv-2409.01143","url":null,"abstract":"Training large language model (LLM) is a computationally intensive task,\u0000which is typically conducted in data centers with homogeneous high-performance\u0000GPUs. This paper explores an alternative approach by deploying the training\u0000computation across heterogeneous GPUs to enable better flexibility and\u0000efficiency for heterogeneous resource utilization. To achieve this goal, we\u0000propose a novel system, FlashFlex, that can flexibly support an asymmetric\u0000partition of the parallel training computations across the scope of data-,\u0000pipeline-, and tensor model parallelism. We further formalize the allocation of\u0000asymmetric partitioned training computations over a set of heterogeneous GPUs\u0000as a constrained optimization problem and propose an efficient solution based\u0000on a hierarchical graph partitioning algorithm. Our approach can adaptively\u0000allocate asymmetric training computations across GPUs, fully leveraging the\u0000available computational power. We conduct extensive empirical studies to\u0000evaluate the performance of FlashFlex, where we find that when training LLMs at\u0000different scales (from 7B to 30B), FlashFlex can achieve comparable training\u0000MFU when running over a set of heterogeneous GPUs compared with the state of\u0000the art training systems running over a set of homogeneous high-performance\u0000GPUs with the same amount of total peak FLOPS. The achieved smallest gaps in\u0000MFU are 11.61% and 0.30%, depending on whether the homogeneous setting is\u0000equipped with and without RDMA. Our implementation is available at\u0000https://github.com/Relaxed-System-Lab/FlashFlex.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Federated Aggregation of Mallows Rankings: A Comparative Analysis of Borda and Lehmer Coding 马洛斯排名的联合聚合：Borda 和 Lehmer 编码的比较分析

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-01 DOI: arxiv-2409.00848

Jin Sima, Vishal Rana, Olgica Milenkovic

{"title":"Federated Aggregation of Mallows Rankings: A Comparative Analysis of Borda and Lehmer Coding","authors":"Jin Sima, Vishal Rana, Olgica Milenkovic","doi":"arxiv-2409.00848","DOIUrl":"https://doi.org/arxiv-2409.00848","url":null,"abstract":"Rank aggregation combines multiple ranked lists into a consensus ranking. In\u0000fields like biomedical data sharing, rankings may be distributed and require\u0000privacy. This motivates the need for federated rank aggregation protocols,\u0000which support distributed, private, and communication-efficient learning across\u0000multiple clients with local data. We present the first known federated rank\u0000aggregation methods using Borda scoring and Lehmer codes, focusing on the\u0000sample complexity for federated algorithms on Mallows distributions with a\u0000known scaling factor $phi$ and an unknown centroid permutation $sigma_0$.\u0000Federated Borda approach involves local client scoring, nontrivial\u0000quantization, and privacy-preserving protocols. We show that for $phi in\u0000[0,1)$, and arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$\u0000clients to locally aggregate $max{C_1(phi), C_2(phi)frac{1}{L}log\u0000frac{N}{delta}}$ rankings, where $C_1(phi)$ and $C_2(phi)$ are constants,\u0000quantize the result, and send it to the server who can then recover $sigma_0$\u0000with probability $geq 1-delta$. Communication complexity scales as $NL log\u0000N$. Our results represent the first rigorous analysis of Borda's method in\u0000centralized and distributed settings under the Mallows model. Federated Lehmer\u0000coding approach creates a local Lehmer code for each client, using a\u0000coordinate-majority aggregation approach with specialized quantization methods\u0000for efficiency and privacy. We show that for $phi+phi^2<1+phi^N$, and\u0000arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$ clients to\u0000locally aggregate $max{C_3(phi), C_4(phi)frac{1}{L}log\u0000frac{N}{delta}}$ rankings, where $C_3(phi)$ and $C_4(phi)$ are constants.\u0000Clients send truncated Lehmer coordinate histograms to the server, which can\u0000recover $sigma_0$ with probability $geq 1-delta$. Communication complexity\u0000is $sim O(Nlog NLlog L)$.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks RTop-K：用于神经网络的超快行向 Top-K 算法和 GPU 实现

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-01 DOI: arxiv-2409.00822

Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding

{"title":"RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks","authors":"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding","doi":"arxiv-2409.00822","DOIUrl":"https://doi.org/arxiv-2409.00822","url":null,"abstract":"Top-k algorithms are essential in various applications, from high-performance\u0000computing and information retrieval to big data and neural network model\u0000training. This paper introduces RTop-K, a highly efficient parallel row-wise\u0000top-k selection algorithm designed for GPUs. RTop-K employs a Binary\u0000Search-based approach to optimize resource allocation and provides a scalable\u0000solution that significantly accelerates top-k operations. We perform a\u0000theoretical analysis of the effects of early stopping in our algorithm,\u0000demonstrating that it maintains the accuracy of neural network models while\u0000enhancing performance. Comprehensive tests show that our GPU implementation of\u0000RTop-K outperforms other row-wise top-k GPU implementations, with minimal\u0000impact on testing accuracy when early stopping is applied. Notably, RTop-K\u0000achieves speed increases ranging from 4.245$times$ to 9.506$times$ with early\u0000stopping, and 3.936$times$ without early stopping, compared to\u0000state-of-the-art implementations. The proposed methods offer significant\u0000improvements in the training and inference of Graph Neural Networks (GNNs),\u0000addressing critical challenges in latency and throughput on GPU platforms.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Container Data Item: An Abstract Datatype for Efficient Container-based Edge Computing 容器数据项：基于容器的高效边缘计算的抽象数据类型

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-01 DOI: arxiv-2409.00801

Md Rezwanur Rahman, Tarun Annapareddy, Shirin Ebadi, Varsha Natarajan, Adarsh Srinivasan, Eric Keller, Shivakant Mishra

{"title":"Container Data Item: An Abstract Datatype for Efficient Container-based Edge Computing","authors":"Md Rezwanur Rahman, Tarun Annapareddy, Shirin Ebadi, Varsha Natarajan, Adarsh Srinivasan, Eric Keller, Shivakant Mishra","doi":"arxiv-2409.00801","DOIUrl":"https://doi.org/arxiv-2409.00801","url":null,"abstract":"We present Container Data Item (CDI), an abstract datatype that allows\u0000multiple containers to efficiently operate on a common data item while\u0000preserving their strong security and isolation semantics. Application\u0000developers can use CDIs to enable multiple containers to operate on the same\u0000data, synchronize execution among themselves, and control the ownership of the\u0000shared data item during runtime. These containers may reside on the same server\u0000or different servers. CDI is designed to support microservice based\u0000applications comprised of a set of interconnected microservices, each\u0000implemented by a separate dedicated container. CDI preserves the important\u0000isolation semantics of containers by ensuring that exactly one container owns a\u0000CDI object at any instant and the ownership of a CDI object may be transferred\u0000from one container to another only by the current CDI object owner. We present\u0000three different implementations of CDI that allow different containers residing\u0000on the same server as well containers residing on different servers to use CDI\u0000for efficiently operating on a common data item. The paper provides an\u0000extensive performance evaluation of CDI along with two representative\u0000applications, an augmented reality application and a decentralized workflow\u0000orchestrator.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration HopGNN：通过以特征为中心的模型迁移提升分布式 GNN 训练效率

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-01 DOI: arxiv-2409.00657

Weijian Chen, Shuibing He, Haoyang Qu, Xuechen Zhang, Dan Feng

{"title":"HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration","authors":"Weijian Chen, Shuibing He, Haoyang Qu, Xuechen Zhang, Dan Feng","doi":"arxiv-2409.00657","DOIUrl":"https://doi.org/arxiv-2409.00657","url":null,"abstract":"Distributed training of graph neural networks (GNNs) has become a crucial\u0000technique for processing large graphs. Prevalent GNN frameworks are\u0000model-centric, necessitating the transfer of massive graph vertex features to\u0000GNN models, which leads to a significant communication bottleneck. Recognizing\u0000that the model size is often significantly smaller than the feature size, we\u0000propose LeapGNN, a feature-centric framework that reverses this paradigm by\u0000bringing GNN models to vertex features. To make it truly effective, we first\u0000propose a micrograph-based training strategy that trains the model using a\u0000refined structure with superior locality to reduce remote feature retrieval.\u0000Then, we devise a feature pre-gathering approach that merges multiple fetch\u0000operations into a single one to eliminate redundant feature transmissions.\u0000Finally, we employ a micrograph-based merging method that adjusts the number of\u0000micrographs for each worker to minimize kernel switches and synchronization\u0000overhead. Our experimental results demonstrate that LeapGNN achieves a\u0000performance speedup of up to 4.2x compared to the state-of-the-art method,\u0000namely P3.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Universal Finite-State and Self-Stabilizing Computation in Anonymous Dynamic Networks 匿名动态网络中的通用有限状态和自稳定计算

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-01 DOI: arxiv-2409.00688

Giuseppe A. Di Luna, Giovanni Viglietta

{"title":"Universal Finite-State and Self-Stabilizing Computation in Anonymous Dynamic Networks","authors":"Giuseppe A. Di Luna, Giovanni Viglietta","doi":"arxiv-2409.00688","DOIUrl":"https://doi.org/arxiv-2409.00688","url":null,"abstract":"A network is said to be \"anonymous\" if its agents are indistinguishable from\u0000each other; it is \"dynamic\" if its communication links may appear or disappear\u0000unpredictably over time. Assuming that an anonymous dynamic network is always\u0000connected and each of its $n$ agents is initially given an input, it takes $2n$\u0000communication rounds for the agents to compute an arbitrary (frequency-based)\u0000function of such inputs (Di Luna-Viglietta, DISC 2023). It is known that, without making additional assumptions on the network and\u0000without knowing the number of agents $n$, it is impossible to compute most\u0000functions and explicitly terminate. In fact, current state-of-the-art\u0000algorithms only achieve stabilization, i.e., allow each agent to return an\u0000output after every communication round; outputs can be changed, and are\u0000guaranteed to be all correct after $2n$ rounds. Such algorithms rely on the\u0000incremental construction of a data structure called \"history tree\", which is\u0000augmented at every round. Thus, they end up consuming an unlimited amount of\u0000memory, and are also prone to errors in case of memory loss or corruption. In this paper, we provide a general self-stabilizing algorithm for anonymous\u0000dynamic networks that stabilizes in $max{4n-2h, 2h}$ rounds (where $h$\u0000measures the amount of corrupted data initially present in the memory of each\u0000agent), as well as a general finite-state algorithm that stabilizes in $3n^2$\u0000rounds. Our work improves upon previously known methods that only apply to\u0000static networks (Boldi-Vigna, Dist. Comp. 2002). In addition, we develop new\u0000fundamental techniques and operations involving history trees, which are of\u0000independent interest.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"213 Suppl 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Demo: FedCampus: A Real-world Privacy-preserving Mobile Application for Smart Campus via Federated Learning & Analytics 演示：FedCampus：通过联合学习与分析为智慧校园提供的真实世界隐私保护移动应用程序

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-31 DOI: arxiv-2409.00327

Jiaxiang Geng, Beilong Tang, Boyan Zhang, Jiaqi Shao, Bing Luo

引用次数: 0