FuncPipe:一个流水线式无服务器框架，用于快速和经济高效的深度学习模型训练

Q4 Computer Science

Performance Evaluation Review Pub Date : 2023-06-26 DOI:10.1145/3606376.3593543

Yunzhuo Liu, Bo Jiang, Tian Guo, Zimeng Huang, Wenhao Ma, Xinbing Wang, Chenghu Zhou

{"title":"FuncPipe:一个流水线式无服务器框架，用于快速和经济高效的深度学习模型训练","authors":"Yunzhuo Liu, Bo Jiang, Tian Guo, Zimeng Huang, Wenhao Ma, Xinbing Wang, Chenghu Zhou","doi":"10.1145/3606376.3593543","DOIUrl":null,"url":null,"abstract":"Training deep learning (DL) models in the cloud has become a norm. With the emergence of serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems researchers have recently started to provide support for serverless-based training. However, the ability to train DL models on serverless platforms is hindered by the resource limitations of today's serverless infrastructure and DL models' explosive requirement for memory and bandwidth. This paper describes FUNCPIPE, a novel pipelined training framework specifically designed for serverless platforms that enable fast and low-cost training of DL models. FUNCPIPE is designed with the key insight that model partitioning can be leveraged to bridge both memory and bandwidth gaps between the capacity of serverless functions and the requirement of DL training. Conceptually simple, we have to answer several design questions, including how to partition the model, configure each serverless function, and exploit each function's uplink/downlink bandwidth. We implement FUNCPIPE on two popular cloud serverless platforms and show that it achieves 7%-77% cost savings and 1.3X-2.2X speedup compared to state-of-the-art serverless-based frameworks.","PeriodicalId":35745,"journal":{"name":"Performance Evaluation Review","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models\",\"authors\":\"Yunzhuo Liu, Bo Jiang, Tian Guo, Zimeng Huang, Wenhao Ma, Xinbing Wang, Chenghu Zhou\",\"doi\":\"10.1145/3606376.3593543\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Training deep learning (DL) models in the cloud has become a norm. With the emergence of serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems researchers have recently started to provide support for serverless-based training. However, the ability to train DL models on serverless platforms is hindered by the resource limitations of today's serverless infrastructure and DL models' explosive requirement for memory and bandwidth. This paper describes FUNCPIPE, a novel pipelined training framework specifically designed for serverless platforms that enable fast and low-cost training of DL models. FUNCPIPE is designed with the key insight that model partitioning can be leveraged to bridge both memory and bandwidth gaps between the capacity of serverless functions and the requirement of DL training. Conceptually simple, we have to answer several design questions, including how to partition the model, configure each serverless function, and exploit each function's uplink/downlink bandwidth. We implement FUNCPIPE on two popular cloud serverless platforms and show that it achieves 7%-77% cost savings and 1.3X-2.2X speedup compared to state-of-the-art serverless-based frameworks.\",\"PeriodicalId\":35745,\"journal\":{\"name\":\"Performance Evaluation Review\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Performance Evaluation Review\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3606376.3593543\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Performance Evaluation Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3606376.3593543","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

摘要

在云端训练深度学习(DL)模型已经成为一种常态。随着无服务器计算的出现及其真正的按需付费定价和可扩展性的好处，系统研究人员最近开始为基于无服务器的培训提供支持。然而，在无服务器平台上训练深度学习模型的能力受到当今无服务器基础设施的资源限制和深度学习模型对内存和带宽的爆炸性需求的阻碍。本文描述了FUNCPIPE，这是一种专门为无服务器平台设计的新型流水线训练框架，可以实现快速、低成本的深度学习模型训练。FUNCPIPE的设计关键在于，可以利用模型划分来弥合无服务器功能容量与深度学习训练需求之间的内存和带宽差距。概念上很简单，我们必须回答几个设计问题，包括如何划分模型，配置每个无服务器功能，以及利用每个功能的上行/下行链路带宽。我们在两个流行的无服务器云平台上实现了FUNCPIPE，并表明与最先进的基于无服务器的框架相比，它节省了7%-77%的成本和1.3 -2.2倍的速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models

Training deep learning (DL) models in the cloud has become a norm. With the emergence of serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems researchers have recently started to provide support for serverless-based training. However, the ability to train DL models on serverless platforms is hindered by the resource limitations of today's serverless infrastructure and DL models' explosive requirement for memory and bandwidth. This paper describes FUNCPIPE, a novel pipelined training framework specifically designed for serverless platforms that enable fast and low-cost training of DL models. FUNCPIPE is designed with the key insight that model partitioning can be leveraged to bridge both memory and bandwidth gaps between the capacity of serverless functions and the requirement of DL training. Conceptually simple, we have to answer several design questions, including how to partition the model, configure each serverless function, and exploit each function's uplink/downlink bandwidth. We implement FUNCPIPE on two popular cloud serverless platforms and show that it achieves 7%-77% cost savings and 1.3X-2.2X speedup compared to state-of-the-art serverless-based frameworks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Performance Evaluation Review Computer Science-Computer Networks and Communications

CiteScore

1.00

自引率

0.00%

发文量

193