A layer-block-wise pipeline for memory and bandwidth reduction in distributed deep learning

2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP) Pub Date : 2017-09-01 DOI:10.1109/MLSP.2017.8168127

Haruki Mori, Tetsuya Youkawa, S. Izumi, M. Yoshimoto, H. Kawaguchi, Atsuki Inoue

{"title":"A layer-block-wise pipeline for memory and bandwidth reduction in distributed deep learning","authors":"Haruki Mori, Tetsuya Youkawa, S. Izumi, M. Yoshimoto, H. Kawaguchi, Atsuki Inoue","doi":"10.1109/MLSP.2017.8168127","DOIUrl":null,"url":null,"abstract":"This paper describes a pipelined stochastic gradient descent (SGD) algorithm and its hardware architecture with a memory distributed structure. In the proposed architecture, a pipeline stage takes charge of multiple layers: a “layer block.” The layer-block-wise pipeline has much less weight parameters for network training than conventional multithreading because weight memory is distributed to workers assigned to pipeline stages. The memory capacity of 2.25 GB for the four-stage proposed pipeline is about half of the 3.82 GB for multithreading when a batch size is 32 in VGG-F. Unlike multithreaded data parallelism, no parameter server for weight update or shared I/O data bus is necessary. Therefore, the memory bandwidth is drastically reduced. The proposed four-stage pipeline only needs memory bandwidths of 36.3 MB and 17.0 MB per batch, respectively, for forward propagation and backpropagation processes, whereas four-thread multithreading requires a bandwidth of 974 MB overall for send and receive processes to unify its weight parameters. At the parallelization degree of four, the proposed pipeline maintains training convergence by a factor of 1.12, compared with the conventional multithreaded architecture although the memory capacity and the memory bandwidth are decreased.","PeriodicalId":6542,"journal":{"name":"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)","volume":"1 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLSP.2017.8168127","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper describes a pipelined stochastic gradient descent (SGD) algorithm and its hardware architecture with a memory distributed structure. In the proposed architecture, a pipeline stage takes charge of multiple layers: a “layer block.” The layer-block-wise pipeline has much less weight parameters for network training than conventional multithreading because weight memory is distributed to workers assigned to pipeline stages. The memory capacity of 2.25 GB for the four-stage proposed pipeline is about half of the 3.82 GB for multithreading when a batch size is 32 in VGG-F. Unlike multithreaded data parallelism, no parameter server for weight update or shared I/O data bus is necessary. Therefore, the memory bandwidth is drastically reduced. The proposed four-stage pipeline only needs memory bandwidths of 36.3 MB and 17.0 MB per batch, respectively, for forward propagation and backpropagation processes, whereas four-thread multithreading requires a bandwidth of 974 MB overall for send and receive processes to unify its weight parameters. At the parallelization degree of four, the proposed pipeline maintains training convergence by a factor of 1.12, compared with the conventional multithreaded architecture although the memory capacity and the memory bandwidth are decreased.

查看原文本刊更多论文

分布式深度学习中减少内存和带宽的分层块管道

介绍了一种基于内存分布式结构的流水线随机梯度下降算法及其硬件结构。在提议的体系结构中，管道阶段负责多个层:一个“层块”。与传统多线程相比，分层块管道的网络训练权重参数要少得多，因为权重内存被分配给分配到管道阶段的工作人员。在VGG-F中，当批处理大小为32时，四级管道的内存容量为2.25 GB，大约是多线程的3.82 GB的一半。与多线程数据并行不同，权重更新或共享I/O数据总线不需要参数服务器。因此，内存带宽大大降低。所提出的四阶段管道每批仅需要36.3 MB和17.0 MB的内存带宽，用于前向传播和反向传播进程，而四线程多线程需要974 MB的带宽用于发送和接收进程以统一其权重参数。在并行度为4的情况下，与传统多线程架构相比，该管道的训练收敛性提高了1.12倍，尽管内存容量和内存带宽有所降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)

自引率

0.00%

发文量