7.1 An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC

2019 IEEE International Solid- State Circuits Conference - (ISSCC) Pub Date : 2019-02-01 DOI:10.1109/ISSCC.2019.8662476

Jinook Song, Yun-Jin Cho, Jun-Seok Park, Jun-Woo Jang, Sehwan Lee, Joonho Song, Jae-Gon Lee, Inyup Kang

{"title":"7.1 An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC","authors":"Jinook Song, Yun-Jin Cho, Jun-Seok Park, Jun-Woo Jang, Sehwan Lee, Joonho Song, Jae-Gon Lee, Inyup Kang","doi":"10.1109/ISSCC.2019.8662476","DOIUrl":null,"url":null,"abstract":"Deep learning has been widely applied for image and speech recognition. Response time, connectivity, privacy and security drive applications towards mobile platforms rather than cloud. For mobile systems-on-a-chip (SoCs), energy-efficient neural processing units (NPU) have been studied for performing the convolutional layers (CLs) and fully-connected layers (FCLs) [2–5] in deep neural networks. Moreover, considering that neural networks are getting deeper, the NPU needs to integrate 1K or even more multiply/accumulate (MAC) units. For energy efficiency, compression of neural networks has been studied by pruning neural connections and quantizing weights and features with 8b or even lower fixed-point precision without accuracy loss [1]. A hardware accelerator exploited network sparsity for high utilization of MAC units [3]. However, since it is challenging to predict where pruning is possible, the accelerator needed complex circuitry for selecting an array of features corresponding to an array of non-zero weights. For reducing the power of MAC operations, bit-serial multipliers have been applied [5]. Generally, extremely low- or variable-bit-precision neural networks need to be carefully trained.","PeriodicalId":265551,"journal":{"name":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"81","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2019.8662476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 81

Abstract

Deep learning has been widely applied for image and speech recognition. Response time, connectivity, privacy and security drive applications towards mobile platforms rather than cloud. For mobile systems-on-a-chip (SoCs), energy-efficient neural processing units (NPU) have been studied for performing the convolutional layers (CLs) and fully-connected layers (FCLs) [2–5] in deep neural networks. Moreover, considering that neural networks are getting deeper, the NPU needs to integrate 1K or even more multiply/accumulate (MAC) units. For energy efficiency, compression of neural networks has been studied by pruning neural connections and quantizing weights and features with 8b or even lower fixed-point precision without accuracy loss [1]. A hardware accelerator exploited network sparsity for high utilization of MAC units [3]. However, since it is challenging to predict where pruning is possible, the accelerator needed complex circuitry for selecting an array of features corresponding to an array of non-zero weights. For reducing the power of MAC operations, bit-serial multipliers have been applied [5]. Generally, extremely low- or variable-bit-precision neural networks need to be carefully trained.

查看原文本刊更多论文

7.1 8.5 tops /W 1024-MAC蝴蝶结构双核稀疏感知神经处理单元的8nm旗舰移动SoC

深度学习已广泛应用于图像和语音识别。响应时间、连接性、隐私和安全性促使应用程序转向移动平台，而不是云。对于移动片上系统(soc)，节能神经处理单元(NPU)已被研究用于在深度神经网络中执行卷积层(cl)和全连接层(fcl)[2-5]。此外，考虑到神经网络越来越深入，NPU需要集成1K甚至更多的乘法/累积(MAC)单元。为了提高能量效率，已经研究了神经网络的压缩，在不损失精度的情况下，以8b甚至更低的不动点精度修剪神经连接，量化权值和特征[1]。硬件加速器利用网络稀疏性来提高MAC单元的利用率[3]。然而，由于预测哪里可能进行修剪是具有挑战性的，因此加速器需要复杂的电路来选择与非零权重数组相对应的特征数组。为了降低MAC操作的功耗，采用了位串行乘法器[5]。一般来说，极低或可变位精度的神经网络需要仔细训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE International Solid- State Circuits Conference - (ISSCC)

自引率

0.00%

发文量