ScaleDNN:基于多gpu的数据移动感知DNN训练

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) Pub Date : 2021-11-01 DOI:10.1109/ICCAD51958.2021.9643503

Weizheng Xu, Ashutosh Pattnaik, Geng Yuan, Yanzhi Wang, Youtao Zhang, Xulong Tang

{"title":"ScaleDNN:基于多gpu的数据移动感知DNN训练","authors":"Weizheng Xu, Ashutosh Pattnaik, Geng Yuan, Yanzhi Wang, Youtao Zhang, Xulong Tang","doi":"10.1109/ICCAD51958.2021.9643503","DOIUrl":null,"url":null,"abstract":"Training Deep Neural Networks (DNNs) models is a time-consuming process that requires immense amount of data and computation. To this end, GPUs are widely adopted to accelerate the training process. However, the delivered training performance rarely scales with the increase in the number of GPUs. The major reason behind this is the large amount of data movement that prevents the system from providing the GPUs with the required data in a timely fashion. In this paper, we propose ScaleDNN, a framework that systematically and comprehensively investigates and optimizes data-parallel training on two types of multi-GPU systems (PCIe-based and NVLink-based). Specifically, ScaleDNN performs: i) CPU-centric input batch splitting, ii) mini-batch data pre-loading, and iii) model parameter compression to effectively a) reduce the data movement between the CPU and multiple GPUs, and b) hide the data movement overheads by overlapping the data transfer with the GPU computation. Our experimental results show that ScaleDNN achieves up to 39.38%, with an average of 17.96% execution time saving over modern data parallelism on PCIe-based multi-GPU system. The corresponding execution time reduction on NVLink-based multi-GPU system is up to 19.20% with an average of 10.26%.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ScaleDNN: Data Movement Aware DNN Training on Multi-GPU\",\"authors\":\"Weizheng Xu, Ashutosh Pattnaik, Geng Yuan, Yanzhi Wang, Youtao Zhang, Xulong Tang\",\"doi\":\"10.1109/ICCAD51958.2021.9643503\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Training Deep Neural Networks (DNNs) models is a time-consuming process that requires immense amount of data and computation. To this end, GPUs are widely adopted to accelerate the training process. However, the delivered training performance rarely scales with the increase in the number of GPUs. The major reason behind this is the large amount of data movement that prevents the system from providing the GPUs with the required data in a timely fashion. In this paper, we propose ScaleDNN, a framework that systematically and comprehensively investigates and optimizes data-parallel training on two types of multi-GPU systems (PCIe-based and NVLink-based). Specifically, ScaleDNN performs: i) CPU-centric input batch splitting, ii) mini-batch data pre-loading, and iii) model parameter compression to effectively a) reduce the data movement between the CPU and multiple GPUs, and b) hide the data movement overheads by overlapping the data transfer with the GPU computation. Our experimental results show that ScaleDNN achieves up to 39.38%, with an average of 17.96% execution time saving over modern data parallelism on PCIe-based multi-GPU system. The corresponding execution time reduction on NVLink-based multi-GPU system is up to 19.20% with an average of 10.26%.\",\"PeriodicalId\":370791,\"journal\":{\"name\":\"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCAD51958.2021.9643503\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAD51958.2021.9643503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

训练深度神经网络(dnn)模型是一个耗时的过程，需要大量的数据和计算。为此，广泛采用gpu来加速训练过程。然而，交付的训练性能很少随gpu数量的增加而扩展。这背后的主要原因是大量的数据移动使系统无法及时向gpu提供所需的数据。在本文中，我们提出了ScaleDNN，这是一个系统地、全面地研究和优化两种类型的多gpu系统(基于pcie和基于nvlink)上的数据并行训练的框架。具体来说，ScaleDNN执行:i)以CPU为中心的输入批分割，ii)小批数据预加载，以及iii)模型参数压缩，以有效地a)减少CPU和多个GPU之间的数据移动，以及b)通过与GPU计算重叠数据传输来隐藏数据移动开销。实验结果表明，在基于pcie的多gpu系统上，ScaleDNN实现了高达39.38%的执行效率，平均节省了17.96%的执行时间。在基于nvlink的多gpu系统上，相应的执行时间减少了19.20%，平均减少了10.26%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ScaleDNN: Data Movement Aware DNN Training on Multi-GPU

Training Deep Neural Networks (DNNs) models is a time-consuming process that requires immense amount of data and computation. To this end, GPUs are widely adopted to accelerate the training process. However, the delivered training performance rarely scales with the increase in the number of GPUs. The major reason behind this is the large amount of data movement that prevents the system from providing the GPUs with the required data in a timely fashion. In this paper, we propose ScaleDNN, a framework that systematically and comprehensively investigates and optimizes data-parallel training on two types of multi-GPU systems (PCIe-based and NVLink-based). Specifically, ScaleDNN performs: i) CPU-centric input batch splitting, ii) mini-batch data pre-loading, and iii) model parameter compression to effectively a) reduce the data movement between the CPU and multiple GPUs, and b) hide the data movement overheads by overlapping the data transfer with the GPU computation. Our experimental results show that ScaleDNN achieves up to 39.38%, with an average of 17.96% execution time saving over modern data parallelism on PCIe-based multi-GPU system. The corresponding execution time reduction on NVLink-based multi-GPU system is up to 19.20% with an average of 10.26%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

自引率

0.00%

发文量