24.1基于CNN的AI边缘处理器的1Mb Multibit ReRAM内存宏和14.6ns并行MAC计算时间

Cheng-Xin Xue, Wei-Hao Chen, Je-Syu Liu, Jia-Fang Li, Wei-Yu Lin, Wei-En Lin, Jing-Hong Wang, Wei-Chen Wei, Ting-Wei Chang, Tung-Cheng Chang, Tsung-Yuan Huang, Hui-Yao Kao, Shih-Ying Wei, Yen-Cheng Chiu, Chun-Ying Lee, C. Lo, Y. King, Chorng-Jung Lin, Ren-Shuo Liu, C. Hsieh, K. Tang, Meng-Fan Chang
{"title":"24.1基于CNN的AI边缘处理器的1Mb Multibit ReRAM内存宏和14.6ns并行MAC计算时间","authors":"Cheng-Xin Xue, Wei-Hao Chen, Je-Syu Liu, Jia-Fang Li, Wei-Yu Lin, Wei-En Lin, Jing-Hong Wang, Wei-Chen Wei, Ting-Wei Chang, Tung-Cheng Chang, Tsung-Yuan Huang, Hui-Yao Kao, Shih-Ying Wei, Yen-Cheng Chiu, Chun-Ying Lee, C. Lo, Y. King, Chorng-Jung Lin, Ren-Shuo Liu, C. Hsieh, K. Tang, Meng-Fan Chang","doi":"10.1109/ISSCC.2019.8662395","DOIUrl":null,"url":null,"abstract":"Embedded nonvolatile memory (NVM) and computing-in-memory (CIM) are significantly reducing the latency (tMAC) and energy consumption (EMAC) of multiply- and-accumulate (MAC) operations in artificial intelligence (AI) edge devices [1, 2]. Previous ReRAM CIM macros demonstrated MAC operations for lb-input, ternary- weighted, 3b-output CNNs [1] or lb-input, 8b-weighted, 1b-output fully-connected networks with limited accuracy [2]. To support higher-accuracy convolution neural network heavy applications NVM-CIM should support multibit inputs/weights and multi-bit output (MAC-OUT) for CNN operations. One way to achieve multibit weights is to use a multi-level ReRAM cell to store the weight. However, as shown in Fig. 24.1.1, multibit ReRAM CIM faces several challenges. (1) a tradeoff between area and speed for multibit input/weight/MAC-OUT MAC operations; (2) sense amplifier’s high input offset, large area, and high parasitic load on the read-path due to large BL currents (IBL) from multibit MAC; (3) limited accuracy due to a small read/sensing margin (ISM) across MAC-OUT or variation in cell resistance (particularly MLC cells). To overcome these challenges, this work proposes, (1) a serial-input non-weighted product (SINWP) structure to optimize the tradeoff between area, tMAC and EMAC, (2) a down-scaling weighted current translator (DSWCT) and positive-negative current- subtractor (PN-ISUB) for short delay, a small offset and a compact read-path area; and (3) a triple-margin small-offset current-mode sense amplifier (TMCSA) to tolerate a small ISM. A fabricated 55nm 1Mb ReRAM-CIM macro is the first ReRAM CIM macro to support CNN operations using multibit input/weight MAC-OUT. This device achieves the shortest CIM-MAC-access time (tAC) among existing ReRAM-CIMs (tMAC=14.6ns with 2b-input, 3b-weight with 4b-MAC-OUT) and the best peak EMAC of 53.17 TOPS/W (in binary mode).","PeriodicalId":265551,"journal":{"name":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"181","resultStr":"{\"title\":\"24.1 A 1Mb Multibit ReRAM Computing-In-Memory Macro with 14.6ns Parallel MAC Computing Time for CNN Based AI Edge Processors\",\"authors\":\"Cheng-Xin Xue, Wei-Hao Chen, Je-Syu Liu, Jia-Fang Li, Wei-Yu Lin, Wei-En Lin, Jing-Hong Wang, Wei-Chen Wei, Ting-Wei Chang, Tung-Cheng Chang, Tsung-Yuan Huang, Hui-Yao Kao, Shih-Ying Wei, Yen-Cheng Chiu, Chun-Ying Lee, C. Lo, Y. King, Chorng-Jung Lin, Ren-Shuo Liu, C. Hsieh, K. Tang, Meng-Fan Chang\",\"doi\":\"10.1109/ISSCC.2019.8662395\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Embedded nonvolatile memory (NVM) and computing-in-memory (CIM) are significantly reducing the latency (tMAC) and energy consumption (EMAC) of multiply- and-accumulate (MAC) operations in artificial intelligence (AI) edge devices [1, 2]. Previous ReRAM CIM macros demonstrated MAC operations for lb-input, ternary- weighted, 3b-output CNNs [1] or lb-input, 8b-weighted, 1b-output fully-connected networks with limited accuracy [2]. To support higher-accuracy convolution neural network heavy applications NVM-CIM should support multibit inputs/weights and multi-bit output (MAC-OUT) for CNN operations. One way to achieve multibit weights is to use a multi-level ReRAM cell to store the weight. However, as shown in Fig. 24.1.1, multibit ReRAM CIM faces several challenges. (1) a tradeoff between area and speed for multibit input/weight/MAC-OUT MAC operations; (2) sense amplifier’s high input offset, large area, and high parasitic load on the read-path due to large BL currents (IBL) from multibit MAC; (3) limited accuracy due to a small read/sensing margin (ISM) across MAC-OUT or variation in cell resistance (particularly MLC cells). To overcome these challenges, this work proposes, (1) a serial-input non-weighted product (SINWP) structure to optimize the tradeoff between area, tMAC and EMAC, (2) a down-scaling weighted current translator (DSWCT) and positive-negative current- subtractor (PN-ISUB) for short delay, a small offset and a compact read-path area; and (3) a triple-margin small-offset current-mode sense amplifier (TMCSA) to tolerate a small ISM. A fabricated 55nm 1Mb ReRAM-CIM macro is the first ReRAM CIM macro to support CNN operations using multibit input/weight MAC-OUT. This device achieves the shortest CIM-MAC-access time (tAC) among existing ReRAM-CIMs (tMAC=14.6ns with 2b-input, 3b-weight with 4b-MAC-OUT) and the best peak EMAC of 53.17 TOPS/W (in binary mode).\",\"PeriodicalId\":265551,\"journal\":{\"name\":\"2019 IEEE International Solid- State Circuits Conference - (ISSCC)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"181\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Solid- State Circuits Conference - (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC.2019.8662395\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2019.8662395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 181

摘要

嵌入式非易失性存储器(NVM)和内存计算(CIM)显著降低了人工智能(AI)边缘设备中乘法累加(MAC)操作的延迟(tMAC)和能耗(EMAC)[1,2]。先前的ReRAM CIM宏演示了lb-input, three -weighted, 3b-output cnn[1]或lb-input, 8b-weighted, 1b-output全连接网络的MAC操作,但精度有限[2]。为了支持更高精度的卷积神经网络重型应用,NVM-CIM应该支持CNN操作的多位输入/权重和多位输出(MAC-OUT)。实现多位权重的一种方法是使用多级ReRAM单元来存储权重。然而,如图24.1.1所示,多位ReRAM CIM面临着几个挑战。(1)在多比特输入/权重/MAC- out MAC操作的面积和速度之间进行权衡;(2)多比特MAC产生的大BL电流(IBL)导致感测放大器输入偏置高、面积大、读路寄生负载高;(3)由于MAC-OUT读取/传感裕度(ISM)小或细胞电阻变化(特别是MLC细胞),准确度有限。为了克服这些挑战,本研究提出:(1)一种串行输入非加权积(SINWP)结构,以优化面积、tMAC和EMAC之间的权衡;(2)一种降尺度加权电流转换器(DSWCT)和正负电流减法器(PN-ISUB),用于短延迟、小偏移和紧凑的读径面积;以及(3)三裕度小偏置电流模式检测放大器(TMCSA)以容忍小ISM。制造的55nm 1Mb ReRAM-CIM宏是第一个使用多位输入/权重MAC-OUT支持CNN操作的ReRAM-CIM宏。该器件实现了现有reram - cim中最短的CIM-MAC-access time (tAC) (2b-input时tMAC=14.6ns, 3b- mac - out时tMAC= 4b- weight)和最佳峰值EMAC(二进制模式下),为53.17 TOPS/W。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
24.1 A 1Mb Multibit ReRAM Computing-In-Memory Macro with 14.6ns Parallel MAC Computing Time for CNN Based AI Edge Processors
Embedded nonvolatile memory (NVM) and computing-in-memory (CIM) are significantly reducing the latency (tMAC) and energy consumption (EMAC) of multiply- and-accumulate (MAC) operations in artificial intelligence (AI) edge devices [1, 2]. Previous ReRAM CIM macros demonstrated MAC operations for lb-input, ternary- weighted, 3b-output CNNs [1] or lb-input, 8b-weighted, 1b-output fully-connected networks with limited accuracy [2]. To support higher-accuracy convolution neural network heavy applications NVM-CIM should support multibit inputs/weights and multi-bit output (MAC-OUT) for CNN operations. One way to achieve multibit weights is to use a multi-level ReRAM cell to store the weight. However, as shown in Fig. 24.1.1, multibit ReRAM CIM faces several challenges. (1) a tradeoff between area and speed for multibit input/weight/MAC-OUT MAC operations; (2) sense amplifier’s high input offset, large area, and high parasitic load on the read-path due to large BL currents (IBL) from multibit MAC; (3) limited accuracy due to a small read/sensing margin (ISM) across MAC-OUT or variation in cell resistance (particularly MLC cells). To overcome these challenges, this work proposes, (1) a serial-input non-weighted product (SINWP) structure to optimize the tradeoff between area, tMAC and EMAC, (2) a down-scaling weighted current translator (DSWCT) and positive-negative current- subtractor (PN-ISUB) for short delay, a small offset and a compact read-path area; and (3) a triple-margin small-offset current-mode sense amplifier (TMCSA) to tolerate a small ISM. A fabricated 55nm 1Mb ReRAM-CIM macro is the first ReRAM CIM macro to support CNN operations using multibit input/weight MAC-OUT. This device achieves the shortest CIM-MAC-access time (tAC) among existing ReRAM-CIMs (tMAC=14.6ns with 2b-input, 3b-weight with 4b-MAC-OUT) and the best peak EMAC of 53.17 TOPS/W (in binary mode).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信