{"title":"弹性管道:DNN训练的高效动态模型并行解决方案","authors":"Jinkun Geng, Dan Li, Shuai Wang","doi":"10.1145/3322795.3331463","DOIUrl":null,"url":null,"abstract":"Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.","PeriodicalId":164694,"journal":{"name":"Proceedings of the 10th Workshop on Scientific Cloud Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training\",\"authors\":\"Jinkun Geng, Dan Li, Shuai Wang\",\"doi\":\"10.1145/3322795.3331463\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.\",\"PeriodicalId\":164694,\"journal\":{\"name\":\"Proceedings of the 10th Workshop on Scientific Cloud Computing\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 10th Workshop on Scientific Cloud Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3322795.3331463\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 10th Workshop on Scientific Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322795.3331463","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training
Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.