Dynamic Flow Scheduling for DNN Training Workloads in Data Centers

IF 4.7 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Network and Service Management Pub Date : 2024-08-27 DOI:10.1109/TNSM.2024.3450670

Xiaoyang Zhao;Chuan Wu;Xia Zhu

{"title":"Dynamic Flow Scheduling for DNN Training Workloads in Data Centers","authors":"Xiaoyang Zhao;Chuan Wu;Xia Zhu","doi":"10.1109/TNSM.2024.3450670","DOIUrl":null,"url":null,"abstract":"Distributed deep learning (DL) training constitutes a significant portion of workloads in modern data centers that are equipped with high computational capacities, such as GPU servers. However, frequent tensor exchanges among workers during distributed deep neural network (DNN) training can result in heavy traffic in the data center network, leading to congestion at server NICs and in the switching network. Unfortunately, none of the existing DL communication libraries support active flow control to optimize tensor transmission performance, instead relying on passive adjustments to the congestion window or sending rate based on packet loss or delay. To address this issue, we propose a flow scheduler per host that dynamically tunes the sending rates of outgoing tensor flows from each server, maximizing network bandwidth utilization and expediting job training progress. Our scheduler comprises two main components: a monitoring module that interacts with state-of-the-art communication libraries supporting parameter server and all-reduce paradigms to track the training progress of DNN jobs, and a congestion control protocol that receives in-network feedback from traversing switches and computes optimized flow sending rates. For data centers where switches are not programmable, we provide a software solution that emulates switch behavior and interacts with the scheduler on servers. Experiments with real-world GPU testbed and trace-driven simulation demonstrate that our scheduler outperforms common rate control protocols and representative learning-based schemes in various settings.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6643-6657"},"PeriodicalIF":4.7000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Network and Service Management","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10649009/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed deep learning (DL) training constitutes a significant portion of workloads in modern data centers that are equipped with high computational capacities, such as GPU servers. However, frequent tensor exchanges among workers during distributed deep neural network (DNN) training can result in heavy traffic in the data center network, leading to congestion at server NICs and in the switching network. Unfortunately, none of the existing DL communication libraries support active flow control to optimize tensor transmission performance, instead relying on passive adjustments to the congestion window or sending rate based on packet loss or delay. To address this issue, we propose a flow scheduler per host that dynamically tunes the sending rates of outgoing tensor flows from each server, maximizing network bandwidth utilization and expediting job training progress. Our scheduler comprises two main components: a monitoring module that interacts with state-of-the-art communication libraries supporting parameter server and all-reduce paradigms to track the training progress of DNN jobs, and a congestion control protocol that receives in-network feedback from traversing switches and computes optimized flow sending rates. For data centers where switches are not programmable, we provide a software solution that emulates switch behavior and interacts with the scheduler on servers. Experiments with real-world GPU testbed and trace-driven simulation demonstrate that our scheduler outperforms common rate control protocols and representative learning-based schemes in various settings.

查看原文本刊更多论文

数据中心 DNN 训练工作负载的动态流量调度

分布式深度学习（DL）训练在配备高计算能力的现代数据中心（如GPU服务器）中构成了很大一部分工作负载。然而，在分布式深度神经网络（DNN）训练过程中，工作人员之间频繁的张量交换会导致数据中心网络的流量过大，从而导致服务器网卡和交换网络的拥塞。不幸的是，现有的DL通信库都不支持主动流量控制来优化张量传输性能，而是依赖于对拥塞窗口或基于丢包或延迟的发送速率的被动调整。为了解决这个问题，我们提出了一个每台主机的流量调度器，它可以动态地调整每台服务器发出的张量流的发送速率，最大限度地提高网络带宽利用率，加快工作培训进度。我们的调度程序包括两个主要组件：一个监控模块，它与最先进的通信库交互，支持参数服务器和all-reduce范式，以跟踪DNN作业的训练进度；一个拥塞控制协议，它接收来自遍历交换机的网络内反馈，并计算优化的流发送速率。对于交换机不可编程的数据中心，我们提供了一种软件解决方案，可以模拟交换机行为并与服务器上的调度程序进行交互。在真实的GPU测试平台和跟踪驱动仿真中进行的实验表明，我们的调度程序在各种设置下优于常见的速率控制协议和代表性的基于学习的方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Network and Service Management Computer Science-Computer Networks and Communications

CiteScore

9.30

自引率

15.10%

发文量

325

期刊介绍： IEEE Transactions on Network and Service Management will publish (online only) peerreviewed archival quality papers that advance the state-of-the-art and practical applications of network and service management. Theoretical research contributions (presenting new concepts and techniques) and applied contributions (reporting on experiences and experiments with actual systems) will be encouraged. These transactions will focus on the key technical issues related to: Management Models, Architectures and Frameworks; Service Provisioning, Reliability and Quality Assurance; Management Functions; Enabling Technologies; Information and Communication Models; Policies; Applications and Case Studies; Emerging Technologies and Standards.