Bamboo:为大型dnn的可负担训练使可抢占实例具有弹性

Symposium on Networked Systems Design and Implementation Pub Date : 2022-04-26 DOI:10.48550/arXiv.2204.12013

John Thorpe, Pengzhan Zhao, Jon Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, R. Netravali, Guoqing Harry Xu

{"title":"Bamboo:为大型dnn的可负担训练使可抢占实例具有弹性","authors":"John Thorpe, Pengzhan Zhao, Jon Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, R. Netravali, Guoqing Harry Xu","doi":"10.48550/arXiv.2204.12013","DOIUrl":null,"url":null,"abstract":"DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales. This paper aims to significantly reduce training costs with effective use of preemptible instances, i.e., those that can be obtained at a much cheaper price while idle, but may be preempted whenever requested by priority users. Doing so, however, requires new forms of resiliency and efficiency to cope with the possibility of frequent preemptions - a failure model that is drastically different from the occasional failures in normal cluster settings that existing checkpointing techniques target. We present Bamboo, a distributed system that tackles these challenges by introducing redundant computations into the training pipeline, i.e., whereby one node performs computations over not only its own layers but also over some layers in its neighbor. Our key insight is that training large models often requires pipeline parallelism where\"pipeline bubbles\"naturally exist. Bamboo carefully fills redundant computations into these bubbles, providing resilience at a low cost. Across a variety of widely used DNN models, Bamboo outperforms traditional checkpointing by 3.7x in training throughput, and reduces costs by 2.4x compared to a setting where on-demand instances are used.","PeriodicalId":365816,"journal":{"name":"Symposium on Networked Systems Design and Implementation","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs\",\"authors\":\"John Thorpe, Pengzhan Zhao, Jon Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, R. Netravali, Guoqing Harry Xu\",\"doi\":\"10.48550/arXiv.2204.12013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales. This paper aims to significantly reduce training costs with effective use of preemptible instances, i.e., those that can be obtained at a much cheaper price while idle, but may be preempted whenever requested by priority users. Doing so, however, requires new forms of resiliency and efficiency to cope with the possibility of frequent preemptions - a failure model that is drastically different from the occasional failures in normal cluster settings that existing checkpointing techniques target. We present Bamboo, a distributed system that tackles these challenges by introducing redundant computations into the training pipeline, i.e., whereby one node performs computations over not only its own layers but also over some layers in its neighbor. Our key insight is that training large models often requires pipeline parallelism where\\\"pipeline bubbles\\\"naturally exist. Bamboo carefully fills redundant computations into these bubbles, providing resilience at a low cost. Across a variety of widely used DNN models, Bamboo outperforms traditional checkpointing by 3.7x in training throughput, and reduces costs by 2.4x compared to a setting where on-demand instances are used.\",\"PeriodicalId\":365816,\"journal\":{\"name\":\"Symposium on Networked Systems Design and Implementation\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Symposium on Networked Systems Design and Implementation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2204.12013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Symposium on Networked Systems Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2204.12013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

跨许多领域的DNN模型的规模继续增长，导致对有效培训的高资源需求，以及跨规模的组织和研究实验室难以接受的(通常是负担不起的)成本。本文的目的是通过有效地利用可抢占实例来显著降低培训成本，即那些在空闲时可以以更低的价格获得，但在优先级用户请求时可以被抢占的实例。然而，这样做需要新形式的弹性和效率来应对频繁抢占的可能性——这种故障模型与现有检查点技术所针对的正常集群设置中的偶尔故障截然不同。我们介绍了Bamboo，这是一个分布式系统，它通过在训练管道中引入冗余计算来解决这些挑战，即一个节点不仅在自己的层上执行计算，还在邻近的一些层上执行计算。我们的关键见解是，训练大型模型通常需要管道并行，而“管道气泡”自然存在。Bamboo小心地将冗余计算填充到这些气泡中，以低成本提供弹性。在各种广泛使用的DNN模型中，Bamboo的训练吞吐量比传统的检查点高3.7倍，与使用按需实例的设置相比，成本降低了2.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales. This paper aims to significantly reduce training costs with effective use of preemptible instances, i.e., those that can be obtained at a much cheaper price while idle, but may be preempted whenever requested by priority users. Doing so, however, requires new forms of resiliency and efficiency to cope with the possibility of frequent preemptions - a failure model that is drastically different from the occasional failures in normal cluster settings that existing checkpointing techniques target. We present Bamboo, a distributed system that tackles these challenges by introducing redundant computations into the training pipeline, i.e., whereby one node performs computations over not only its own layers but also over some layers in its neighbor. Our key insight is that training large models often requires pipeline parallelism where"pipeline bubbles"naturally exist. Bamboo carefully fills redundant computations into these bubbles, providing resilience at a low cost. Across a variety of widely used DNN models, Bamboo outperforms traditional checkpointing by 3.7x in training throughput, and reduces costs by 2.4x compared to a setting where on-demand instances are used.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Symposium on Networked Systems Design and Implementation

自引率

0.00%

发文量