15.3基于65nm 3T动态模拟ram的内存中计算宏和CNN加速器，具有保留增强、自适应模拟稀疏性和44TOPS/W系统能效

2021 IEEE International Solid- State Circuits Conference (ISSCC) Pub Date : 2021-02-13 DOI:10.1109/ISSCC42613.2021.9366045

Zhengyu Chen, X. Chen, Jie Gu

{"title":"15.3基于65nm 3T动态模拟ram的内存中计算宏和CNN加速器，具有保留增强、自适应模拟稀疏性和44TOPS/W系统能效","authors":"Zhengyu Chen, X. Chen, Jie Gu","doi":"10.1109/ISSCC42613.2021.9366045","DOIUrl":null,"url":null,"abstract":"Computing-In-Memory (CIM) techniques which incorporate analog computing inside memory macros have shown significant advantages in computing efficiency for deep learning applications. While earlier CIM macros were limited by lower bit precision, e.g. binary weights in [1], recent works have shown 4-to-8b precision for the weights/inputs and up to 20b for the output values [2], [3]. Sparsity and application features have also been exploited at the system level to further improve the computation efficiency [4], [5]. To enable higher precision, bit-wise operations were commonly utilized [3], [4]. However, there are limitations in existing solutions using the bit-wise operations with SRAM cells. Fig. 15.3.1 shows the summary of challenges and solutions in this work. First, all existing solutions utilize 6T/8T/10T SRAM as a CIM cell, which fundamentally limits the size of the CIM array. In this work, we replace the commonly used SRAM cell with a 3-transistor (3T) analog memory cell, referred as dynamic-analog-RAM (DARAM) which represents a 4b weight value as an analog voltage. This leads to $\\sim 10 \\times$ reduction in transistor count and achieves an effective CIM single-bit area smaller than the foundry-supplied 6T SRAM cell. Secondly, as no bit-wise calculation is needed in this work, only single-phase MAC operations are performed, removing the throughput degradation associated with previous multi-phase approaches and digital accumulation in [3], [4]. Furthermore, analog linearity issues are mitigated by highly linear time-based activation, removal of matching requirements for critical multi-bit caps [4], [6], and a special read current compensation technique. Thirdly, to mitigate the power bottleneck of ADC or SA, this work applies analog sparsity-based low-power methods, which include a compute-adaptive ADC skipping operation when the analog MAC value is small (or “sparse”) and a special weight-shifting technique, leading to an additional $\\sim 2 \\times$ reduction in CIM-macro power. We demonstrate the proposed techniques using a 65nm CIM-based CNN accelerator showing state-of-art energy efficiency.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":"{\"title\":\"15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency\",\"authors\":\"Zhengyu Chen, X. Chen, Jie Gu\",\"doi\":\"10.1109/ISSCC42613.2021.9366045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computing-In-Memory (CIM) techniques which incorporate analog computing inside memory macros have shown significant advantages in computing efficiency for deep learning applications. While earlier CIM macros were limited by lower bit precision, e.g. binary weights in [1], recent works have shown 4-to-8b precision for the weights/inputs and up to 20b for the output values [2], [3]. Sparsity and application features have also been exploited at the system level to further improve the computation efficiency [4], [5]. To enable higher precision, bit-wise operations were commonly utilized [3], [4]. However, there are limitations in existing solutions using the bit-wise operations with SRAM cells. Fig. 15.3.1 shows the summary of challenges and solutions in this work. First, all existing solutions utilize 6T/8T/10T SRAM as a CIM cell, which fundamentally limits the size of the CIM array. In this work, we replace the commonly used SRAM cell with a 3-transistor (3T) analog memory cell, referred as dynamic-analog-RAM (DARAM) which represents a 4b weight value as an analog voltage. This leads to $\\\\sim 10 \\\\times$ reduction in transistor count and achieves an effective CIM single-bit area smaller than the foundry-supplied 6T SRAM cell. Secondly, as no bit-wise calculation is needed in this work, only single-phase MAC operations are performed, removing the throughput degradation associated with previous multi-phase approaches and digital accumulation in [3], [4]. Furthermore, analog linearity issues are mitigated by highly linear time-based activation, removal of matching requirements for critical multi-bit caps [4], [6], and a special read current compensation technique. Thirdly, to mitigate the power bottleneck of ADC or SA, this work applies analog sparsity-based low-power methods, which include a compute-adaptive ADC skipping operation when the analog MAC value is small (or “sparse”) and a special weight-shifting technique, leading to an additional $\\\\sim 2 \\\\times$ reduction in CIM-macro power. We demonstrate the proposed techniques using a 65nm CIM-based CNN accelerator showing state-of-art energy efficiency.\",\"PeriodicalId\":371093,\"journal\":{\"name\":\"2021 IEEE International Solid- State Circuits Conference (ISSCC)\",\"volume\":\"74 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Solid- State Circuits Conference (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC42613.2021.9366045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42613.2021.9366045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

摘要

内存计算(CIM)技术将模拟计算集成到内存宏中，在深度学习应用的计算效率方面显示出显著的优势。虽然早期的CIM宏受到较低的位精度的限制，例如[1]中的二进制权重，但最近的研究表明，权重/输入的精度为4到8b，输出值的精度高达20b[2]，[3]。在系统层面也利用了稀疏性和应用特性来进一步提高计算效率[4]，[5]。为了实现更高的精度，通常使用逐位操作[3]，[4]。然而，在使用SRAM单元的位操作的现有解决方案中存在局限性。图15.3.1显示了这项工作的挑战和解决方案的总结。首先，所有现有的解决方案都使用6T/8T/10T SRAM作为CIM单元，这从根本上限制了CIM阵列的大小。在这项工作中，我们将常用的SRAM单元替换为3晶体管(3T)模拟存储单元，称为动态模拟ram (DARAM)，它代表4b权重值作为模拟电压。这使得晶体管数量减少了10倍，并实现了比代工厂供应的6T SRAM单元更小的有效CIM单比特面积。其次，由于本工作不需要逐位计算，因此只执行单相MAC操作，从而消除了[3]，[4]中与先前多相方法和数字累积相关的吞吐量下降。此外，模拟线性问题通过高度线性的基于时间的激活、去除关键多比特上限的匹配要求[4]、[6]和特殊的读电流补偿技术得到缓解。第三，为了缓解ADC或SA的功率瓶颈，本工作应用了基于模拟稀疏性的低功耗方法，其中包括当模拟MAC值很小(或“稀疏”)时的计算自适应ADC跳变操作和特殊的权重转移技术，导致cim宏功率额外降低2倍。我们使用65纳米基于cim的CNN加速器演示了所提出的技术，显示了最先进的能源效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency

Computing-In-Memory (CIM) techniques which incorporate analog computing inside memory macros have shown significant advantages in computing efficiency for deep learning applications. While earlier CIM macros were limited by lower bit precision, e.g. binary weights in [1], recent works have shown 4-to-8b precision for the weights/inputs and up to 20b for the output values [2], [3]. Sparsity and application features have also been exploited at the system level to further improve the computation efficiency [4], [5]. To enable higher precision, bit-wise operations were commonly utilized [3], [4]. However, there are limitations in existing solutions using the bit-wise operations with SRAM cells. Fig. 15.3.1 shows the summary of challenges and solutions in this work. First, all existing solutions utilize 6T/8T/10T SRAM as a CIM cell, which fundamentally limits the size of the CIM array. In this work, we replace the commonly used SRAM cell with a 3-transistor (3T) analog memory cell, referred as dynamic-analog-RAM (DARAM) which represents a 4b weight value as an analog voltage. This leads to $\sim 10 \times$ reduction in transistor count and achieves an effective CIM single-bit area smaller than the foundry-supplied 6T SRAM cell. Secondly, as no bit-wise calculation is needed in this work, only single-phase MAC operations are performed, removing the throughput degradation associated with previous multi-phase approaches and digital accumulation in [3], [4]. Furthermore, analog linearity issues are mitigated by highly linear time-based activation, removal of matching requirements for critical multi-bit caps [4], [6], and a special read current compensation technique. Thirdly, to mitigate the power bottleneck of ADC or SA, this work applies analog sparsity-based low-power methods, which include a compute-adaptive ADC skipping operation when the analog MAC value is small (or “sparse”) and a special weight-shifting technique, leading to an additional $\sim 2 \times$ reduction in CIM-macro power. We demonstrate the proposed techniques using a 65nm CIM-based CNN accelerator showing state-of-art energy efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Solid- State Circuits Conference (ISSCC)

自引率

0.00%

发文量