A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors

2018 IEEE International Solid - State Circuits Conference - (ISSCC) Pub Date : 2018-03-08 DOI:10.1109/ISSCC.2018.8310401

W. Khwa, Jia-Jing Chen, Jia-Fang Li, Xin Si, En-Yu Yang, Xiaoyu Sun, Rui Liu, Pai-Yu Chen, Qiang Li, Shimeng Yu, Meng-Fan Chang

{"title":"A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors","authors":"W. Khwa, Jia-Jing Chen, Jia-Fang Li, Xin Si, En-Yu Yang, Xiaoyu Sun, Rui Liu, Pai-Yu Chen, Qiang Li, Shimeng Yu, Meng-Fan Chang","doi":"10.1109/ISSCC.2018.8310401","DOIUrl":null,"url":null,"abstract":"For deep-neural-network (DNN) processors [1-4], the product-sum (PS) operation predominates the computational workload for both convolution (CNVL) and fully-connect (FCNL) neural-network (NN) layers. This hinders the adoption of DNN processors to on the edge artificial-intelligence (AI) devices, which require low-power, low-cost and fast inference. Binary DNNs [5-6] are used to reduce computation and hardware costs for AI edge devices; however, a memory bottleneck still remains. In Fig. 31.5.1 conventional PE arrays exploit parallelized computation, but suffer from inefficient single-row SRAM access to weights and intermediate data. Computing-in-memory (CIM) improves efficiency by enabling parallel computing, reducing memory accesses, and suppressing intermediate data. Nonetheless, three critical challenges remain (Fig. 31.5.2), particularly for FCNL. We overcome these problems by co-optimizing the circuits and the system. Recently, researches have been focusing on XNOR based binary-DNN structures [6]. Although they achieve a slightly higher accuracy, than other binary structures, they require a significant hardware cost (i.e. 8T-12T SRAM) to implement a CIM system. To further reduce the hardware cost, by using 6T SRAM to implement a CIM system, we employ binary DNN with 0/1-neuron and ±1-weight that was proposed in [7]. We implemented a 65nm 4Kb algorithm-dependent CIM-SRAM unit-macro and in-house binary DNN structure (focusing on FCNL with a simplified PE array), for cost-aware DNN AI edge processors. This resulted in the first binary-based CIM-SRAM macro with the fastest (2.3ns) PS operation, and the highest energy-efficiency (55.8TOPS/W) among reported CIM macros [3-4].","PeriodicalId":6617,"journal":{"name":"2018 IEEE International Solid - State Circuits Conference - (ISSCC)","volume":"11 1","pages":"496-498"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"156","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Solid - State Circuits Conference - (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2018.8310401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 156

Abstract

For deep-neural-network (DNN) processors [1-4], the product-sum (PS) operation predominates the computational workload for both convolution (CNVL) and fully-connect (FCNL) neural-network (NN) layers. This hinders the adoption of DNN processors to on the edge artificial-intelligence (AI) devices, which require low-power, low-cost and fast inference. Binary DNNs [5-6] are used to reduce computation and hardware costs for AI edge devices; however, a memory bottleneck still remains. In Fig. 31.5.1 conventional PE arrays exploit parallelized computation, but suffer from inefficient single-row SRAM access to weights and intermediate data. Computing-in-memory (CIM) improves efficiency by enabling parallel computing, reducing memory accesses, and suppressing intermediate data. Nonetheless, three critical challenges remain (Fig. 31.5.2), particularly for FCNL. We overcome these problems by co-optimizing the circuits and the system. Recently, researches have been focusing on XNOR based binary-DNN structures [6]. Although they achieve a slightly higher accuracy, than other binary structures, they require a significant hardware cost (i.e. 8T-12T SRAM) to implement a CIM system. To further reduce the hardware cost, by using 6T SRAM to implement a CIM system, we employ binary DNN with 0/1-neuron and ±1-weight that was proposed in [7]. We implemented a 65nm 4Kb algorithm-dependent CIM-SRAM unit-macro and in-house binary DNN structure (focusing on FCNL with a simplified PE array), for cost-aware DNN AI edge processors. This resulted in the first binary-based CIM-SRAM macro with the fastest (2.3ns) PS operation, and the highest energy-efficiency (55.8TOPS/W) among reported CIM macros [3-4].

查看原文本刊更多论文

基于算法的65nm内存SRAM单元宏，2.3ns和55.8TOPS/W全并行积和运算，用于二进制DNN边缘处理器

对于深度神经网络(DNN)处理器[1-4]，乘积和(PS)运算在卷积(CNVL)和全连接(FCNL)神经网络(NN)层的计算工作量中占主导地位。这阻碍了在边缘人工智能(AI)设备上采用深度神经网络处理器，这需要低功耗、低成本和快速推理。二进制dnn[5-6]用于减少AI边缘设备的计算和硬件成本;然而，内存瓶颈仍然存在。在图31.5.1中，传统的PE阵列利用并行计算，但受到单行SRAM对权重和中间数据的低效访问的影响。内存计算(CIM)通过支持并行计算、减少内存访问和抑制中间数据来提高效率。尽管如此，三个关键的挑战仍然存在(图31.5.2)，特别是对于FCNL。我们通过共同优化电路和系统来克服这些问题。近年来，基于XNOR的二元dnn结构成为研究热点[6]。虽然它们比其他二进制结构实现更高的精度，但它们需要显著的硬件成本(即8T-12T SRAM)来实现CIM系统。为了进一步降低硬件成本，通过使用6T SRAM来实现CIM系统，我们采用了[7]中提出的0/1神经元和±1权重的二进制DNN。我们实现了一个65nm 4Kb算法相关的CIM-SRAM单元宏和内部二进制DNN结构(专注于简化PE阵列的FCNL)，用于成本敏感的DNN AI边缘处理器。这导致了第一个基于二进制的CIM- sram宏具有最快的(2.3ns) PS操作和最高的能源效率(55.8TOPS/W)在报道的CIM宏中[3-4]。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Solid - State Circuits Conference - (ISSCC)

自引率

0.00%

发文量