vClos:多租户GPU集群中分布式机器学习任务的网络竞争感知调度

IF 4.6 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Xinchi Han , Shizhen Zhao , Yongxi Lv , Peirui Cao , Weihao Jiang , Qinwei Yang , Yunzhuo Liu , Shengkai Lin , Bo Jiang , Ximeng Liu , Yong Cui , Chenghu Zhou , Xinbing Wang
{"title":"vClos:多租户GPU集群中分布式机器学习任务的网络竞争感知调度","authors":"Xinchi Han ,&nbsp;Shizhen Zhao ,&nbsp;Yongxi Lv ,&nbsp;Peirui Cao ,&nbsp;Weihao Jiang ,&nbsp;Qinwei Yang ,&nbsp;Yunzhuo Liu ,&nbsp;Shengkai Lin ,&nbsp;Bo Jiang ,&nbsp;Ximeng Liu ,&nbsp;Yong Cui ,&nbsp;Chenghu Zhou ,&nbsp;Xinbing Wang","doi":"10.1016/j.comnet.2025.111285","DOIUrl":null,"url":null,"abstract":"<div><div>Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a <span><math><mi>P</mi></math></span>-incast-free Pattern. We propose a <em>Balanced Routing</em> which leverages training traffic patterns to reduce contention. Furthermore, we introduce <em>vClos</em> to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show <em>vClos</em> reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"268 ","pages":"Article 111285"},"PeriodicalIF":4.6000,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"vClos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant GPU clusters\",\"authors\":\"Xinchi Han ,&nbsp;Shizhen Zhao ,&nbsp;Yongxi Lv ,&nbsp;Peirui Cao ,&nbsp;Weihao Jiang ,&nbsp;Qinwei Yang ,&nbsp;Yunzhuo Liu ,&nbsp;Shengkai Lin ,&nbsp;Bo Jiang ,&nbsp;Ximeng Liu ,&nbsp;Yong Cui ,&nbsp;Chenghu Zhou ,&nbsp;Xinbing Wang\",\"doi\":\"10.1016/j.comnet.2025.111285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a <span><math><mi>P</mi></math></span>-incast-free Pattern. We propose a <em>Balanced Routing</em> which leverages training traffic patterns to reduce contention. Furthermore, we introduce <em>vClos</em> to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show <em>vClos</em> reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.</div></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":\"268 \",\"pages\":\"Article 111285\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1389128625002531\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625002531","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

分布式机器学习(DML)技术可以在合理的时间内训练大型神经网络。然而,随着计算能力的增长快于网络容量的增长,网络通信成为DML训练的瓶颈。当前的多租户GPU集群由于哈希冲突而存在网络争用问题,这不仅增加了通信开销,而且会增加用户的等待时间。本文分析了网络竞争对训练吞吐量的影响,并将训练流量模式总结为P-incast-free模式。我们提出了一种平衡路由,它利用训练流量模式来减少争用。此外,我们引入了vClos,通过联合考虑拓扑、路由、通信模式和GPU分配来处理争用。通过试验台实验和基于实时跟踪的模拟进行的评估表明,在繁重的工作负载下,与ECMP相比,vClos可将平均作业完成时间缩短67.6%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
vClos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant GPU clusters
Distributed machine learning (DML) technology enables training large neural networks in reasonable time. However, with computing power growing faster than network capacity, network communication becomes the DML training bottleneck. Current multi-tenant GPU clusters suffer network contention due to hash-collision, which not only increases the overhead of communication but also will increase the waiting time of users. This paper analyzes how network contention slows training throughput and summarizes training traffic patterns as a P-incast-free Pattern. We propose a Balanced Routing which leverages training traffic patterns to reduce contention. Furthermore, we introduce vClos to handle contention through jointly considering topology, routing, communication pattern, and GPU assignment. Evaluations via testbed experiments and real-trace-based simulations show vClos reduces average job completion time up to 67.6% compared to ECMP in heavy workloads.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computer Networks
Computer Networks 工程技术-电信学
CiteScore
10.80
自引率
3.60%
发文量
434
审稿时长
8.6 months
期刊介绍: Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信