CSCMAC - Cyclic Sparsely Connected Neural Network Manycore Accelerator

Hirenkumar Paneliya, M. Hosseini, Avesta Sasan, H. Homayoun, T. Mohsenin
{"title":"CSCMAC - Cyclic Sparsely Connected Neural Network Manycore Accelerator","authors":"Hirenkumar Paneliya, M. Hosseini, Avesta Sasan, H. Homayoun, T. Mohsenin","doi":"10.1109/ISQED48828.2020.9137013","DOIUrl":null,"url":null,"abstract":"This paper presents an energy-efficient, domain-specific manycore accelerator also referred to as the “CSCMAC” - Cyclic Sparsely Connected Neural Network Manycore Accelerator, which effectively maps and executes deep neural networks (DNNs) compressed with cyclic sparsely connected (CSC) architectures. CSC layers are architectures that structurally compress and sparsify DNNs, which can reduce the memory footprint of fully connected (FC) layers from $O(N^{2})$ to $O(N\\log N)$ with respect to layers nodes, and is shown to be hardware implementable-friendly. We implement CSC layers for inference on a manycore unit, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring indicate that by replacing FC layers with CSC layers, we can achieve $46\\times$ and $6\\times$ compression respectively within a margin of 2% accuracy loss. A 64-cluster architecture of the CSCMAC is fully placed and routed using $65\\mathrm{nm}$, TSMC CMOS technology. The layout of each cluster occupies an area of $0.73\\ mm^{2}$ and consumes $230.2 \\mathrm{mW}$ power at 980 MHz clock frequency. Our proposed CSCMAC achieves $1.48\\times$ higher throughput and $1.49\\times$ lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves $85\\times$ higher throughput and consumes $66.4\\times$ lower energy compared to CPU implementation of the NVIDIA Jetson TX2 platform.","PeriodicalId":225828,"journal":{"name":"2020 21st International Symposium on Quality Electronic Design (ISQED)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 21st International Symposium on Quality Electronic Design (ISQED)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISQED48828.2020.9137013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

This paper presents an energy-efficient, domain-specific manycore accelerator also referred to as the “CSCMAC” - Cyclic Sparsely Connected Neural Network Manycore Accelerator, which effectively maps and executes deep neural networks (DNNs) compressed with cyclic sparsely connected (CSC) architectures. CSC layers are architectures that structurally compress and sparsify DNNs, which can reduce the memory footprint of fully connected (FC) layers from $O(N^{2})$ to $O(N\log N)$ with respect to layers nodes, and is shown to be hardware implementable-friendly. We implement CSC layers for inference on a manycore unit, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring indicate that by replacing FC layers with CSC layers, we can achieve $46\times$ and $6\times$ compression respectively within a margin of 2% accuracy loss. A 64-cluster architecture of the CSCMAC is fully placed and routed using $65\mathrm{nm}$, TSMC CMOS technology. The layout of each cluster occupies an area of $0.73\ mm^{2}$ and consumes $230.2 \mathrm{mW}$ power at 980 MHz clock frequency. Our proposed CSCMAC achieves $1.48\times$ higher throughput and $1.49\times$ lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves $85\times$ higher throughput and consumes $66.4\times$ lower energy compared to CPU implementation of the NVIDIA Jetson TX2 platform.
CSCMAC -循环稀疏连接神经网络多核加速器
本文提出了一种节能的、特定领域的多核加速器,也称为“CSCMAC”——循环稀疏连接神经网络多核加速器,它可以有效地映射和执行用循环稀疏连接(CSC)架构压缩的深度神经网络(dnn)。CSC层是一种结构上压缩和稀疏dnn的架构,它可以将完全连接(FC)层的内存占用从$O(N^{2})$减少到$O(N\log N)$,并且被证明是硬件可实现的。我们在多核单元上实现了用于推理的CSC层,利用了它们的循环架构,并表明它们在软件中的实现即使对于并行计算处理器也是可行的。为了进一步利用其实现的简单性,我们为多核提出了融合常用机器码序列的定制指令,并评估了定制所获得的优化效果。我们在MNIST上使用LeNet300100和在体育活动监测上使用多层感知器(MLP)的实验结果表明,通过用CSC层代替FC层,我们可以在2%的精度损失范围内分别实现$46\times$和$6\times$的压缩。CSCMAC的64集群架构采用TSMC CMOS技术$65\ mathm {nm}$完全放置和路由。每个集群的布局占用$0.73\ mm^{2}$的面积,在980mhz时钟频率下消耗$230.2 \ mathm {mW}$的功耗。我们提出的CSCMAC与其等效的前身多核(PENC)相比,吞吐量提高了1.48倍,能耗降低了1.49倍。此外,与NVIDIA Jetson TX2平台的CPU实现相比,CSCMAC的吞吐量提高了85倍,能耗降低了66.4倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信