vClos：多租户GPU集群中分布式机器学习任务的网络竞争感知调度

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computer Networks Pub Date : 2025-05-24 DOI:10.1016/j.comnet.2025.111285

Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang

{"title":"vClos：多租户GPU集群中分布式机器学习任务的网络竞争感知调度","authors":"Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang","doi":"10.1016/j.comnet.2025.111285","DOIUrl":null,"url":null,"abstract":"<div><div>Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a <span><math><mi>P</mi></math></span>-incast-free Pattern. We propose a <em>Balanced Routing</em> which leverages training traffic patterns to reduce contention. Furthermore, we introduce <em>vClos</em> to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show <em>vClos</em> reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"268 ","pages":"Article 111285"},"PeriodicalIF":4.6000,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"vClos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant GPU clusters\",\"authors\":\"Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang\",\"doi\":\"10.1016/j.comnet.2025.111285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a <span><math><mi>P</mi></math></span>-incast-free Pattern. We propose a <em>Balanced Routing</em> which leverages training traffic patterns to reduce contention. Furthermore, we introduce <em>vClos</em> to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show <em>vClos</em> reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.</div></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":\"268 \",\"pages\":\"Article 111285\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1389128625002531\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625002531","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

分布式机器学习（DML）技术可以在合理的时间内训练大型神经网络。然而，随着计算能力的增长快于网络容量的增长，网络通信成为DML训练的瓶颈。当前的多租户GPU集群由于哈希冲突而存在网络争用问题，这不仅增加了通信开销，而且会增加用户的等待时间。本文分析了网络竞争对训练吞吐量的影响，并将训练流量模式总结为P-incast-free模式。我们提出了一种平衡路由，它利用训练流量模式来减少争用。此外，我们引入了vClos，通过联合考虑拓扑、路由、通信模式和GPU分配来处理争用。通过试验台实验和基于实时跟踪的模拟进行的评估表明，在繁重的工作负载下，与ECMP相比，vClos可将平均作业完成时间缩短67.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

vClos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant GPU clusters

Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a

P

-incast-free Pattern. We propose a Balanced Routing which leverages training traffic patterns to reduce contention. Furthermore, we introduce vClos to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show vClos reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Networks 工程技术-电信学

CiteScore

10.80

自引率

3.60%

发文量

434

审稿时长

8.6 months

期刊介绍： Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.