DATIC: A Data-Aware Time-Domain Computing-in-Memory-Based CNN Processor With Dynamic Channel Skipping and Mapping

Jianxun Yang;Yuyao Kong;Yixuan Li;Chenfu Guo;Hao Sun;Leibo Liu;Shaojun Wei;Jun Yang;Shouyi Yin
{"title":"DATIC: A Data-Aware Time-Domain Computing-in-Memory-Based CNN Processor With Dynamic Channel Skipping and Mapping","authors":"Jianxun Yang;Yuyao Kong;Yixuan Li;Chenfu Guo;Hao Sun;Leibo Liu;Shaojun Wei;Jun Yang;Shouyi Yin","doi":"10.1109/OJSSCS.2022.3216562","DOIUrl":null,"url":null,"abstract":"Due to the low-power priority of analog delay-based computation, time-domain computing-in-memory (TD-CIM) presents a splendid potential for energy-constrained edge and IoT scenarios deploying convolutional neural networks (CNNs). However, the latency in delay-based computation is proportional to the numbers and values of multiplications-and-accumulations (MACs), bottlenecking the throughput of previous data-agnostic TD-CIM-based processors which compute complete convolutions in a fixed MAC mapping manner. First, some output activations in each layer of CNNs contribute less to the final classification results, which are insignificant and can be substituted by sums of partial MACs, with a marginal accuracy degradation. Thus, complete convolution computations lead to redundant MACs. Second, activations and weights vary with input images and models. Fixed MAC mapping leads to unbalanced MAC values on delay chains, causing long idle time and latency. To address that, we design a data-aware TD-CIM-based CNN processor, DATIC, with three techniques to reduce latency: 1) a channel-skipping TD-CIM macro to remove redundant MACs for insignificant output activations (IOAs), by storing activations stationary in SRAM bitcells and shifting weights to perform only imperative MACs; 2) a convolution-order programming unit to reduce overhead of skipping redundant MACs for IOAs with random positions on feature maps; and 3) an activation-weight-adaptive channel-mapping scheduler to balance the latency of delay chains by dynamically altering the convolution mapping manner. Implemented under TSMC 28-nm technology, DATIC achieves 622.9-GOPS throughput and 32.7-TOPS/W energy efficiency for ResNet-18 with 2-b weights and 8-b activations.","PeriodicalId":100633,"journal":{"name":"IEEE Open Journal of the Solid-State Circuits Society","volume":"2 ","pages":"244-258"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8782712/9733783/09927338.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Solid-State Circuits Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/9927338/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Due to the low-power priority of analog delay-based computation, time-domain computing-in-memory (TD-CIM) presents a splendid potential for energy-constrained edge and IoT scenarios deploying convolutional neural networks (CNNs). However, the latency in delay-based computation is proportional to the numbers and values of multiplications-and-accumulations (MACs), bottlenecking the throughput of previous data-agnostic TD-CIM-based processors which compute complete convolutions in a fixed MAC mapping manner. First, some output activations in each layer of CNNs contribute less to the final classification results, which are insignificant and can be substituted by sums of partial MACs, with a marginal accuracy degradation. Thus, complete convolution computations lead to redundant MACs. Second, activations and weights vary with input images and models. Fixed MAC mapping leads to unbalanced MAC values on delay chains, causing long idle time and latency. To address that, we design a data-aware TD-CIM-based CNN processor, DATIC, with three techniques to reduce latency: 1) a channel-skipping TD-CIM macro to remove redundant MACs for insignificant output activations (IOAs), by storing activations stationary in SRAM bitcells and shifting weights to perform only imperative MACs; 2) a convolution-order programming unit to reduce overhead of skipping redundant MACs for IOAs with random positions on feature maps; and 3) an activation-weight-adaptive channel-mapping scheduler to balance the latency of delay chains by dynamically altering the convolution mapping manner. Implemented under TSMC 28-nm technology, DATIC achieves 622.9-GOPS throughput and 32.7-TOPS/W energy efficiency for ResNet-18 with 2-b weights and 8-b activations.
DATIC:一种具有动态信道跳过和映射的基于内存的CNN处理器中的数据感知时域计算
由于基于模拟延迟的计算具有低功耗优先级,时域内存计算(TD-CIM)在部署卷积神经网络(CNNs)的能量受限边缘和物联网场景中具有巨大潜力。然而,基于延迟的计算中的延迟与乘法和累加(MAC)的数量和值成比例,这阻碍了以前的数据不可知的基于TD CIM的处理器的吞吐量,这些处理器以固定的MAC映射方式计算完整的卷积。首先,每层细胞神经网络中的一些输出激活对最终分类结果的贡献较小,这些结果是不重要的,可以用部分MAC的总和来代替,具有边际精度下降。因此,完整的卷积计算会导致冗余MAC。其次,激活和权重随输入图像和模型的不同而变化。固定的MAC映射会导致延迟链上的MAC值不平衡,导致长的空闲时间和延迟。为了解决这一问题,我们设计了一个基于数据感知TD-CIM的CNN处理器DATIC,该处理器具有三种技术来减少延迟:1)一个跳过信道的TD-CIM宏,通过将激活固定存储在SRAM位单元中并移动权重以仅执行命令性MAC,来删除不重要的输出激活(IOA)的冗余MAC;2) 卷积顺序编程单元,用于减少跳过特征图上具有随机位置的IOA的冗余MAC的开销;以及3)激活权重自适应信道映射调度器,用于通过动态改变卷积映射方式来平衡延迟链的延迟。在台积电28纳米技术下实施,DATIC实现了622.9-GOPS的吞吐量和32.7-TOPS/W的能量效率,ResNet-18具有2-b的重量和8-b的激活。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信