CSCMAC - Cyclic Sparsely Connected Neural Network Manycore Accelerator

2020 21st International Symposium on Quality Electronic Design (ISQED) Pub Date : 2020-03-01 DOI:10.1109/ISQED48828.2020.9137013

Hirenkumar Paneliya, M. Hosseini, Avesta Sasan, H. Homayoun, T. Mohsenin

{"title":"CSCMAC - Cyclic Sparsely Connected Neural Network Manycore Accelerator","authors":"Hirenkumar Paneliya, M. Hosseini, Avesta Sasan, H. Homayoun, T. Mohsenin","doi":"10.1109/ISQED48828.2020.9137013","DOIUrl":null,"url":null,"abstract":"This paper presents an energy-efficient, domain-specific manycore accelerator also referred to as the “CSCMAC” - Cyclic Sparsely Connected Neural Network Manycore Accelerator, which effectively maps and executes deep neural networks (DNNs) compressed with cyclic sparsely connected (CSC) architectures. CSC layers are architectures that structurally compress and sparsify DNNs, which can reduce the memory footprint of fully connected (FC) layers from $O(N^{2})$ to $O(N\\log N)$ with respect to layers nodes, and is shown to be hardware implementable-friendly. We implement CSC layers for inference on a manycore unit, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring indicate that by replacing FC layers with CSC layers, we can achieve $46\\times$ and $6\\times$ compression respectively within a margin of 2% accuracy loss. A 64-cluster architecture of the CSCMAC is fully placed and routed using $65\\mathrm{nm}$, TSMC CMOS technology. The layout of each cluster occupies an area of $0.73\\ mm^{2}$ and consumes $230.2 \\mathrm{mW}$ power at 980 MHz clock frequency. Our proposed CSCMAC achieves $1.48\\times$ higher throughput and $1.49\\times$ lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves $85\\times$ higher throughput and consumes $66.4\\times$ lower energy compared to CPU implementation of the NVIDIA Jetson TX2 platform.","PeriodicalId":225828,"journal":{"name":"2020 21st International Symposium on Quality Electronic Design (ISQED)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 21st International Symposium on Quality Electronic Design (ISQED)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISQED48828.2020.9137013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper presents an energy-efficient, domain-specific manycore accelerator also referred to as the “CSCMAC” - Cyclic Sparsely Connected Neural Network Manycore Accelerator, which effectively maps and executes deep neural networks (DNNs) compressed with cyclic sparsely connected (CSC) architectures. CSC layers are architectures that structurally compress and sparsify DNNs, which can reduce the memory footprint of fully connected (FC) layers from $O(N^{2})$ to $O(N\log N)$ with respect to layers nodes, and is shown to be hardware implementable-friendly. We implement CSC layers for inference on a manycore unit, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring indicate that by replacing FC layers with CSC layers, we can achieve $46\times$ and $6\times$ compression respectively within a margin of 2% accuracy loss. A 64-cluster architecture of the CSCMAC is fully placed and routed using $65\mathrm{nm}$, TSMC CMOS technology. The layout of each cluster occupies an area of $0.73\ mm^{2}$ and consumes $230.2 \mathrm{mW}$ power at 980 MHz clock frequency. Our proposed CSCMAC achieves $1.48\times$ higher throughput and $1.49\times$ lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves $85\times$ higher throughput and consumes $66.4\times$ lower energy compared to CPU implementation of the NVIDIA Jetson TX2 platform.

查看原文本刊更多论文

CSCMAC -循环稀疏连接神经网络多核加速器

本文提出了一种节能的、特定领域的多核加速器，也称为“CSCMAC”——循环稀疏连接神经网络多核加速器，它可以有效地映射和执行用循环稀疏连接(CSC)架构压缩的深度神经网络(dnn)。CSC层是一种结构上压缩和稀疏dnn的架构，它可以将完全连接(FC)层的内存占用从$O(N^{2})$减少到$O(N\log N)$，并且被证明是硬件可实现的。我们在多核单元上实现了用于推理的CSC层，利用了它们的循环架构，并表明它们在软件中的实现即使对于并行计算处理器也是可行的。为了进一步利用其实现的简单性，我们为多核提出了融合常用机器码序列的定制指令，并评估了定制所获得的优化效果。我们在MNIST上使用LeNet300100和在体育活动监测上使用多层感知器(MLP)的实验结果表明，通过用CSC层代替FC层，我们可以在2%的精度损失范围内分别实现$46\times$和$6\times$的压缩。CSCMAC的64集群架构采用TSMC CMOS技术$65\ mathm {nm}$完全放置和路由。每个集群的布局占用$0.73\ mm^{2}$的面积，在980mhz时钟频率下消耗$230.2 \ mathm {mW}$的功耗。我们提出的CSCMAC与其等效的前身多核(PENC)相比，吞吐量提高了1.48倍，能耗降低了1.49倍。此外，与NVIDIA Jetson TX2平台的CPU实现相比，CSCMAC的吞吐量提高了85倍，能耗降低了66.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 21st International Symposium on Quality Electronic Design (ISQED)

自引率

0.00%

发文量