MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning Applications

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI:10.1109/ipdps53621.2022.00118

Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Song Fu, Qing Yang, Ming-Qing Liu

{"title":"MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning Applications","authors":"Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Song Fu, Qing Yang, Ming-Qing Liu","doi":"10.1109/ipdps53621.2022.00118","DOIUrl":null,"url":null,"abstract":"The ever-increasing number of layers, millions of parameters, and large data volume make deep learning workloads resource-intensive and power-hungry. In this paper, we develop a convolutional neural network (CNN) acceleration framework, named MLCNN, which explores algorithm-hardware co-design to achieve cross-layer cooperative optimization and acceleration. MLCNN dramatically reduces computation and on-off chip communication, improving CNN's performance. To achieve this, MLCNN reorders the position of nonlinear activation layers and pooling layers, which we prove results in a negligible accuracy loss; then the convolutional layer and pooling layer are co-optimized by means of redundant multiplication elimination, local addition reuse, and global addition reuse. To the best of our knowledge, MLCNN is the first of its kind that incorporates cooperative optimization across convolutional, activation, and pooling layers. We further customize the MLCNN accelerator to take full advantage of cross-layer CNN optimization to reduce both computation and on-off chip communication. Our analysis shows that MLCNN can significantly reduce (up to 98%) multiplications and additions. We have implemented a prototype of MLCNN and evaluated its performance on several widely used CNN models using both an accelerator-level cycle and energy model and RTL implementation. Experimental results show that MLCNN achieves 3.2x speedup and 2.9x energy efficiency compared with dense CNNs. MLCNN's optimization methods are orthogonal to other CNN acceleration techniques, such as quantization and pruning. Combined with quantization, our quantized MLCNN gains a 12.8x speedup and 11.3x energy efficiency compared with DCNN.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps53621.2022.00118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The ever-increasing number of layers, millions of parameters, and large data volume make deep learning workloads resource-intensive and power-hungry. In this paper, we develop a convolutional neural network (CNN) acceleration framework, named MLCNN, which explores algorithm-hardware co-design to achieve cross-layer cooperative optimization and acceleration. MLCNN dramatically reduces computation and on-off chip communication, improving CNN's performance. To achieve this, MLCNN reorders the position of nonlinear activation layers and pooling layers, which we prove results in a negligible accuracy loss; then the convolutional layer and pooling layer are co-optimized by means of redundant multiplication elimination, local addition reuse, and global addition reuse. To the best of our knowledge, MLCNN is the first of its kind that incorporates cooperative optimization across convolutional, activation, and pooling layers. We further customize the MLCNN accelerator to take full advantage of cross-layer CNN optimization to reduce both computation and on-off chip communication. Our analysis shows that MLCNN can significantly reduce (up to 98%) multiplications and additions. We have implemented a prototype of MLCNN and evaluated its performance on several widely used CNN models using both an accelerator-level cycle and energy model and RTL implementation. Experimental results show that MLCNN achieves 3.2x speedup and 2.9x energy efficiency compared with dense CNNs. MLCNN's optimization methods are orthogonal to other CNN acceleration techniques, such as quantization and pruning. Combined with quantization, our quantized MLCNN gains a 12.8x speedup and 11.3x energy efficiency compared with DCNN.

查看原文本刊更多论文

MLCNN:加速深度学习应用的跨层协同优化和加速器架构

不断增加的层数、数以百万计的参数和大数据量使深度学习工作负载成为资源密集型和耗电量巨大的工作负载。在本文中，我们开发了一个卷积神经网络(CNN)加速框架，命名为MLCNN，该框架探索了算法-硬件协同设计，以实现跨层协同优化和加速。MLCNN大大减少了计算量和芯片的通断通信，提高了CNN的性能。为了实现这一点，MLCNN对非线性激活层和池化层的位置进行了重新排序，我们证明了这样做的结果是可以忽略不计的精度损失;然后通过消除冗余乘法、局部加法重用和全局加法重用等方法对卷积层和池化层进行协同优化。据我们所知，MLCNN是同类中第一个集成了跨卷积层、激活层和池化层的协作优化的算法。我们进一步定制MLCNN加速器，以充分利用跨层CNN优化来减少计算和芯片的通断通信。我们的分析表明，MLCNN可以显著减少(高达98%)乘法和加法。我们实现了MLCNN的原型，并使用加速器级循环和能量模型以及RTL实现在几种广泛使用的CNN模型上评估了其性能。实验结果表明，与密集cnn相比，MLCNN的速度提高了3.2倍，能效提高了2.9倍。MLCNN的优化方法与其他CNN加速技术(如量化和修剪)是正交的。与量化相结合，与DCNN相比，我们的量化MLCNN的速度提高了12.8倍，能效提高了11.3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量