Accelerating container-based deep learning hyperparameter optimization workloads

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI:10.1145/3533028.3533309

Rui Liu, David Wong, David J. Lange, Patrik Larsson, Vinay Jethava, Qing Zheng

{"title":"Accelerating container-based deep learning hyperparameter optimization workloads","authors":"Rui Liu, David Wong, David J. Lange, Patrik Larsson, Vinay Jethava, Qing Zheng","doi":"10.1145/3533028.3533309","DOIUrl":null,"url":null,"abstract":"DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533028.3533309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.

查看原文本刊更多论文

加速基于容器的深度学习超参数优化工作负载

DocuSign在人工智能领域取得了长足的进步，并不断向开发和部署越来越多的深度学习模型转变。在开发阶段，开发人员通常构建许多深度学习模型，并使用一堆潜在的超参数配置来训练它们，以找到性能最佳的模型，这被称为超参数优化(HPO)。由于更大的模型和大量的超参数配置，这种HPO作业可以运行很长时间。此外，DocuSign的HPO作业是在基于容器的环境中处理的，因此可以在生产环境中可靠、高效地部署和维护性能最佳的模型。工作负载由长时间运行和容器化的HPO作业组成，这些作业可以迅速使DocuSign中的当前机器学习基础设施饱和，但关键资源(例如GPU内存或计算单元)并不总是被充分利用，例如，一些超参数配置可能只占用GPU内存的一小部分，但由于容器化将占用整个设备。遇到此问题时，用户可能不得不等待或手动与其他人协调资源来运行作业，并且此类HPO工作负载通常需要意想不到的长时间才能完成。为了解决这个问题，我们提出了一个专为加速HPO工作负载而设计的系统，它通过分段HPO作业和在基于容器的环境中有效地共享GPU资源来加速HPO工作负载，以便多个容器化的分段作业可以并行执行。我们对DocuSign的一个研发团队的多租户GPU集群进行了为期三个月的HPO工作负载跟踪来评估flavor，结果表明flavor可以显着提高GPU利用率并通过高效的多任务执行加速工作负载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

自引率

0.00%

发文量