Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang
{"title":"vClos:多租户GPU集群中分布式机器学习任务的网络竞争感知调度","authors":"Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang","doi":"10.1016/j.comnet.2025.111285","DOIUrl":null,"url":null,"abstract":"<div><div>Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a <span><math><mi>P</mi></math></span>-incast-free Pattern. We propose a <em>Balanced Routing</em> which leverages training traffic patterns to reduce contention. Furthermore, we introduce <em>vClos</em> to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show <em>vClos</em> reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"268 ","pages":"Article 111285"},"PeriodicalIF":4.6000,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"vClos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant GPU clusters\",\"authors\":\"Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang\",\"doi\":\"10.1016/j.comnet.2025.111285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a <span><math><mi>P</mi></math></span>-incast-free Pattern. We propose a <em>Balanced Routing</em> which leverages training traffic patterns to reduce contention. Furthermore, we introduce <em>vClos</em> to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show <em>vClos</em> reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.</div></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":\"268 \",\"pages\":\"Article 111285\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1389128625002531\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625002531","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
vClos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant GPU clusters
Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a -incast-free Pattern. We propose a Balanced Routing which leverages training traffic patterns to reduce contention. Furthermore, we introduce vClos to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show vClos reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.