16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips

2021 IEEE International Solid- State Circuits Conference (ISSCC) Pub Date : 2021-02-13 DOI:10.1109/ISSCC42613.2021.9365984

Jian-Wei Su, Yen-Chi Chou, Ruhui Liu, Ta-Wei Liu, Pei-Jung Lu, P. Wu, Yen-Lin Chung, Li-Yang Hung, Jin-Sheng Ren, Tianlong Pan, Sih-Han Li, Shih-Chieh Chang, S. Sheu, W. Lo, Chih-I Wu, Xin Si, C. Lo, Ren-Shuo Liu, C. Hsieh, K. Tang, Meng-Fan Chang

{"title":"16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips","authors":"Jian-Wei Su, Yen-Chi Chou, Ruhui Liu, Ta-Wei Liu, Pei-Jung Lu, P. Wu, Yen-Lin Chung, Li-Yang Hung, Jin-Sheng Ren, Tianlong Pan, Sih-Han Li, Shih-Chieh Chang, S. Sheu, W. Lo, Chih-I Wu, Xin Si, C. Lo, Ren-Shuo Liu, C. Hsieh, K. Tang, Meng-Fan Chang","doi":"10.1109/ISSCC42613.2021.9365984","DOIUrl":null,"url":null,"abstract":"Recent SRAM-based computation-in-memory (CIM) macros enable mid-to-high precision multiply-and-accumulate (MAC) operations with improved energy efficiency using ultra-small/small capacity (0.4-8KB) memory devices. However, advanced CIM-based edge-AI chips favor multiple mid/large capacity SRAM-CIM macros: with high input (IN) and weight (W) precision to reduce the frequency of data reloads from external DRAM, and to avoid the need for additional SRAM buffers or ultra-large on-chip weight buffers. However, enlarging memory capacity and throughput increases the delay parasitics on WLs and BLs, and the number of parallel computing elements; resulting in longer compute latency (tAC), lower energy-efficiency (EF), degraded signal margin, and larger fluctuations in power consumption across data-patterns (see Fig. 16.3.1). Recent SRAM-CIM macros tend to not use in-lab SRAM cells, with a logic-based layout, in favor of foundry provided compact-layout 8T [2], 3, [5] or 6T cells with local-computing cells (LCCs) [4], [6] to reduce the cell-array area and facilitate manufacturing. This paper presents a SRAM-CIM structure using (1) a segmented-BL charge-sharing (SBCS) scheme for MAC operations, with low energy consumption and a consistently high signal margin across MAC values (MACV); (2) An new LCC cell, called a source-injection local-multiplication cell (SILMC), to support the SBCS scheme with a consistent signal margin against transistor process variation; and (3) A prioritized-hybrid-ADC (Ph-ADC) to achieve a small area and power overhead for analog readout. A 28nm 384kb SRAM-CIM macro was fabricated using a foundry compact-6T cell with support for MAC operations with 16 accumulations of 8b-inputs and 8b-weights with near-full precision output (20b). This macro achieves a 7.2ns tAC and a 22.75TOPS/W EF for 8b-MAC operations with an FoM (IN-precision × W-precision × output-ratio × output-channel × EF/tAC) 6× higher than prior work.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"76","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42613.2021.9365984","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 76

Abstract

Recent SRAM-based computation-in-memory (CIM) macros enable mid-to-high precision multiply-and-accumulate (MAC) operations with improved energy efficiency using ultra-small/small capacity (0.4-8KB) memory devices. However, advanced CIM-based edge-AI chips favor multiple mid/large capacity SRAM-CIM macros: with high input (IN) and weight (W) precision to reduce the frequency of data reloads from external DRAM, and to avoid the need for additional SRAM buffers or ultra-large on-chip weight buffers. However, enlarging memory capacity and throughput increases the delay parasitics on WLs and BLs, and the number of parallel computing elements; resulting in longer compute latency (tAC), lower energy-efficiency (EF), degraded signal margin, and larger fluctuations in power consumption across data-patterns (see Fig. 16.3.1). Recent SRAM-CIM macros tend to not use in-lab SRAM cells, with a logic-based layout, in favor of foundry provided compact-layout 8T [2], 3, [5] or 6T cells with local-computing cells (LCCs) [4], [6] to reduce the cell-array area and facilitate manufacturing. This paper presents a SRAM-CIM structure using (1) a segmented-BL charge-sharing (SBCS) scheme for MAC operations, with low energy consumption and a consistently high signal margin across MAC values (MACV); (2) An new LCC cell, called a source-injection local-multiplication cell (SILMC), to support the SBCS scheme with a consistent signal margin against transistor process variation; and (3) A prioritized-hybrid-ADC (Ph-ADC) to achieve a small area and power overhead for analog readout. A 28nm 384kb SRAM-CIM macro was fabricated using a foundry compact-6T cell with support for MAC operations with 16 accumulations of 8b-inputs and 8b-weights with near-full precision output (20b). This macro achieves a 7.2ns tAC and a 22.75TOPS/W EF for 8b-MAC operations with an FoM (IN-precision × W-precision × output-ratio × output-channel × EF/tAC) 6× higher than prior work.

查看原文本刊更多论文

16.3用于AI边缘芯片的28nm 384kb 6T-SRAM内存宏计算，精度为8b

最近基于sram的内存计算(CIM)宏使用超小型/小容量(0.4-8KB)内存设备，可以实现中高精度的乘法累加(MAC)操作，并提高了能效。然而，先进的基于cim的边缘ai芯片支持多个中/大容量SRAM- cim宏:具有高输入(IN)和重量(W)精度，以减少从外部DRAM重新加载数据的频率，并避免需要额外的SRAM缓冲区或超大片上重量缓冲区。然而，内存容量和吞吐量的增加增加了时延寄生在wl和bl上，并增加了并行计算单元的数量;导致更长的计算延迟(tAC)、更低的能效(EF)、退化的信号裕度，以及跨数据模式的更大功耗波动(见图16.3.1)。最近的SRAM- cim宏倾向于不使用实验室SRAM单元，具有基于逻辑的布局，而倾向于代工提供紧凑布局的8T[2]， 3，[5]或带有本地计算单元(lcc)[4]，[6]的6T单元，以减少单元阵列面积并促进制造。本文提出了一种SRAM-CIM结构，使用(1)MAC操作的分段bl电荷共享(SBCS)方案，具有低能耗和跨MAC值(MACV)一致的高信号余量;(2)一种新的LCC单元，称为源注入局部倍增单元(SILMC)，以支持SBCS方案，具有一致的信号裕度，可以抵抗晶体管工艺变化;(3)优先混合adc (Ph-ADC)，以实现模拟读出的小面积和功耗开销。一个28nm 384kb的SRAM-CIM宏是使用代工紧凑的6t单元制造的，支持MAC操作，具有16个8b输入和8b权重的累积，具有接近全精度的输出(20b)。对于8b-MAC操作，该宏实现了7.2ns tAC和22.75TOPS/W EF, FoM (in精度× W精度×输出比×输出通道× EF/tAC)比以前的工作高6倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Solid- State Circuits Conference (ISSCC)

自引率

0.00%

发文量