Doing more by doing less: how structured partial backpropagation improves deep learning clusters

Adarsh Kumar, Kausik Subramanian, S. Venkataraman, Aditya Akella
{"title":"Doing more by doing less: how structured partial backpropagation improves deep learning clusters","authors":"Adarsh Kumar, Kausik Subramanian, S. Venkataraman, Aditya Akella","doi":"10.1145/3488659.3493778","DOIUrl":null,"url":null,"abstract":"Many organizations employ compute clusters equipped with accelerators such as GPUs and TPUs for training deep learning models in a distributed fashion. Training is resource-intensive, consuming significant compute, memory, and network resources. Many prior works explore how to reduce training resource footprint without impacting quality, but their focus on a subset of the bottlenecks (typically only the network) limits their ability to improve overall cluster utilization. In this work, we exploit the unique characteristics of deep learning workloads to propose Structured Partial Backpropagation(SPB), a technique that systematically controls the amount of backpropagation at individual workers in distributed training. This simultaneously reduces network bandwidth, compute utilization, and memory footprint while preserving model quality. To efficiently leverage the benefits of SPB at cluster level, we introduce Jigsaw, a SPB aware scheduler, which does scheduling at the iteration level for Deep Learning Training(DLT) jobs. We find that Jigsaw can improve large scale cluster efficiency by as high as 28%.","PeriodicalId":343000,"journal":{"name":"Proceedings of the 2nd ACM International Workshop on Distributed Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM International Workshop on Distributed Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3488659.3493778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Many organizations employ compute clusters equipped with accelerators such as GPUs and TPUs for training deep learning models in a distributed fashion. Training is resource-intensive, consuming significant compute, memory, and network resources. Many prior works explore how to reduce training resource footprint without impacting quality, but their focus on a subset of the bottlenecks (typically only the network) limits their ability to improve overall cluster utilization. In this work, we exploit the unique characteristics of deep learning workloads to propose Structured Partial Backpropagation(SPB), a technique that systematically controls the amount of backpropagation at individual workers in distributed training. This simultaneously reduces network bandwidth, compute utilization, and memory footprint while preserving model quality. To efficiently leverage the benefits of SPB at cluster level, we introduce Jigsaw, a SPB aware scheduler, which does scheduling at the iteration level for Deep Learning Training(DLT) jobs. We find that Jigsaw can improve large scale cluster efficiency by as high as 28%.
事半功倍:结构化部分反向传播如何改进深度学习集群
许多组织使用配备gpu和tpu等加速器的计算集群,以分布式方式训练深度学习模型。培训是资源密集型的,需要消耗大量的计算、内存和网络资源。许多先前的工作探索了如何在不影响质量的情况下减少训练资源占用,但是他们对瓶颈子集(通常只有网络)的关注限制了他们提高整体集群利用率的能力。在这项工作中,我们利用深度学习工作负载的独特特征提出了结构化部分反向传播(SPB),这是一种系统地控制分布式训练中单个工作人员的反向传播量的技术。这同时减少了网络带宽、计算利用率和内存占用,同时保持了模型质量。为了有效地利用SPB在集群级别的优势,我们引入了Jigsaw,一个SPB感知调度器,它在迭代级别为深度学习训练(DLT)作业进行调度。我们发现Jigsaw可以将大规模集群的效率提高高达28%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信