Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang
{"title":"vClos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant GPU clusters","authors":"Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang","doi":"10.1016/j.comnet.2025.111285","DOIUrl":null,"url":null,"abstract":"<div><div>Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a <span><math><mi>P</mi></math></span>-incast-free Pattern. We propose a <em>Balanced Routing</em> which leverages training traffic patterns to reduce contention. Furthermore, we introduce <em>vClos</em> to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show <em>vClos</em> reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"268 ","pages":"Article 111285"},"PeriodicalIF":4.6000,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625002531","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a -incast-free Pattern. We propose a Balanced Routing which leverages training traffic patterns to reduce contention. Furthermore, we introduce vClos to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show vClos reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.