分布式深度神经网络训练中调度解耦全约简原语的高效通信框架

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Emerging Topics in Computing Pub Date : 2025-06-02 DOI:10.1109/TETC.2025.3573522

Yunqi Gao;Bing Hu;Mahdi Boloursaz Mashhadi;Wei Wang;Rahim Tafazolli;Mérouane Debbah

{"title":"分布式深度神经网络训练中调度解耦全约简原语的高效通信框架","authors":"Yunqi Gao;Bing Hu;Mahdi Boloursaz Mashhadi;Wei Wang;Rahim Tafazolli;Mérouane Debbah","doi":"10.1109/TETC.2025.3573522","DOIUrl":null,"url":null,"abstract":"Communication scheduling effectively improves the scalability of distributed deep learning by overlapping computation and communication tasks during training. However, existing communication scheduling frameworks based on tensor partitioning suffer from two fundamental issues: (1) partitioning schemes at the data volume level introduce extensive startup overheads leading to higher energy consumption, and (2) partitioning schemes at the communication primitive level do not provide optimal scheduling resulting in longer training time. In this article, we propose an efficient communication mechanism, namely PipeDAP, which schedules decoupled all-reduce operations in a near-optimal order to minimize the time and energy consumption of training DNN models. We build the mathematical model for PipeDAP and derive the near-optimal scheduling order of the reduce-scatter and all-gather operations. Meanwhile, we leverage simultaneous communication of reduce-scatter and all-gather operations to further reduce the startup overheads. We implement the PipeDAP architecture on PyTorch framework, and apply it for distributed training of benchmark DNN models. Experimental results on two GPU clusters demonstrate that PipeDAP achieves up to 1.82x speedup and saves up to 45.4% of energy consumption compared to the state-of-the-art communication scheduling frameworks.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1170-1184"},"PeriodicalIF":5.4000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PipeDAP: An Efficient Communication Framework for Scheduling Decoupled All-Reduce Primitives in Distributed DNN Training\",\"authors\":\"Yunqi Gao;Bing Hu;Mahdi Boloursaz Mashhadi;Wei Wang;Rahim Tafazolli;Mérouane Debbah\",\"doi\":\"10.1109/TETC.2025.3573522\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Communication scheduling effectively improves the scalability of distributed deep learning by overlapping computation and communication tasks during training. However, existing communication scheduling frameworks based on tensor partitioning suffer from two fundamental issues: (1) partitioning schemes at the data volume level introduce extensive startup overheads leading to higher energy consumption, and (2) partitioning schemes at the communication primitive level do not provide optimal scheduling resulting in longer training time. In this article, we propose an efficient communication mechanism, namely PipeDAP, which schedules decoupled all-reduce operations in a near-optimal order to minimize the time and energy consumption of training DNN models. We build the mathematical model for PipeDAP and derive the near-optimal scheduling order of the reduce-scatter and all-gather operations. Meanwhile, we leverage simultaneous communication of reduce-scatter and all-gather operations to further reduce the startup overheads. We implement the PipeDAP architecture on PyTorch framework, and apply it for distributed training of benchmark DNN models. Experimental results on two GPU clusters demonstrate that PipeDAP achieves up to 1.82x speedup and saves up to 45.4% of energy consumption compared to the state-of-the-art communication scheduling frameworks.\",\"PeriodicalId\":13156,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computing\",\"volume\":\"13 3\",\"pages\":\"1170-1184\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11021340/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11021340/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

通信调度通过在训练过程中重叠计算和通信任务，有效地提高了分布式深度学习的可扩展性。然而，现有的基于张量分区的通信调度框架存在两个基本问题：(1)数据量级别的分区方案引入了大量的启动开销，导致较高的能耗；(2)通信原语级别的分区方案不能提供最优调度，导致训练时间较长。在本文中，我们提出了一种高效的通信机制，即PipeDAP，它以近乎最优的顺序调度解耦的全约运算，以最大限度地减少训练DNN模型的时间和能量消耗。建立了PipeDAP的数学模型，导出了减少分散和全聚操作的近最优调度顺序。同时，我们利用reduce-scatter和all-gather操作的同步通信，进一步降低启动开销。我们在PyTorch框架上实现了PipeDAP架构，并将其应用于基准DNN模型的分布式训练。在两个GPU集群上的实验结果表明，与目前最先进的通信调度框架相比，PipeDAP实现了高达1.82倍的加速，节省了45.4%的能耗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PipeDAP: An Efficient Communication Framework for Scheduling Decoupled All-Reduce Primitives in Distributed DNN Training

Communication scheduling effectively improves the scalability of distributed deep learning by overlapping computation and communication tasks during training. However, existing communication scheduling frameworks based on tensor partitioning suffer from two fundamental issues: (1) partitioning schemes at the data volume level introduce extensive startup overheads leading to higher energy consumption, and (2) partitioning schemes at the communication primitive level do not provide optimal scheduling resulting in longer training time. In this article, we propose an efficient communication mechanism, namely PipeDAP, which schedules decoupled all-reduce operations in a near-optimal order to minimize the time and energy consumption of training DNN models. We build the mathematical model for PipeDAP and derive the near-optimal scheduling order of the reduce-scatter and all-gather operations. Meanwhile, we leverage simultaneous communication of reduce-scatter and all-gather operations to further reduce the startup overheads. We implement the PipeDAP architecture on PyTorch framework, and apply it for distributed training of benchmark DNN models. Experimental results on two GPU clusters demonstrate that PipeDAP achieves up to 1.82x speedup and saves up to 45.4% of energy consumption compared to the state-of-the-art communication scheduling frameworks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Emerging Topics in Computing Computer Science-Computer Science (miscellaneous)

CiteScore

12.10

自引率

5.10%

发文量

113

期刊介绍： IEEE Transactions on Emerging Topics in Computing publishes papers on emerging aspects of computer science, computing technology, and computing applications not currently covered by other IEEE Computer Society Transactions. Some examples of emerging topics in computing include: IT for Green, Synthetic and organic computing structures and systems, Advanced analytics, Social/occupational computing, Location-based/client computer systems, Morphic computer design, Electronic game systems, & Health-care IT.