分布式机器学习的延迟感知压缩控制

IEEE INFOCOM 2021 - IEEE Conference on Computer Communications Pub Date : 2021-05-10 DOI:10.1109/INFOCOM42981.2021.9488810

A. Abdelmoniem, M. Canini

{"title":"分布式机器学习的延迟感知压缩控制","authors":"A. Abdelmoniem, M. Canini","doi":"10.1109/INFOCOM42981.2021.9488810","DOIUrl":null,"url":null,"abstract":"Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41× and 5.3 × over baselines with no-compression and uniform compression, respectively.","PeriodicalId":293079,"journal":{"name":"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"DC2: Delay-aware Compression Control for Distributed Machine Learning\",\"authors\":\"A. Abdelmoniem, M. Canini\",\"doi\":\"10.1109/INFOCOM42981.2021.9488810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41× and 5.3 × over baselines with no-compression and uniform compression, respectively.\",\"PeriodicalId\":293079,\"journal\":{\"name\":\"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INFOCOM42981.2021.9488810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFOCOM42981.2021.9488810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

分布式训练对深度神经网络模型进行数据并行训练，这是日益复杂的模型和大数据集的需要。最近的工作是确定分布式训练中的主要通信瓶颈。这些工作寻找可能的机会来加速支持分布式机器学习工作负载的系统的训练。随着通信的减少，提出了压缩技术来加快这一通信阶段。然而，压缩是以降低模型精度为代价的，特别是在任意应用压缩时。相反，我们主张对压缩进行更可控的使用，并提出DC2，一种延迟感知的压缩控制机制。DC2将压缩控制和网络延迟相结合，实现自适应压缩。DC2不仅可以补偿网络的变化，还可以在训练速度和准确性之间取得更好的平衡。DC2是作为ML工具包使用的通信库的插入模块实现的，可以在各种网络设置中操作。我们对网络环境中的DC2进行了实证评估，显示出低延迟和高延迟变化。我们对不同流行的CNN模型和数据集的评估表明，DC2比无压缩和均匀压缩的基线分别提高了41倍和5.3倍的训练速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DC2: Delay-aware Compression Control for Distributed Machine Learning

Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41× and 5.3 × over baselines with no-compression and uniform compression, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE INFOCOM 2021 - IEEE Conference on Computer Communications

自引率

0.00%

发文量