Hirenkumar Paneliya, M. Hosseini, Avesta Sasan, H. Homayoun, T. Mohsenin
{"title":"CSCMAC - Cyclic Sparsely Connected Neural Network Manycore Accelerator","authors":"Hirenkumar Paneliya, M. Hosseini, Avesta Sasan, H. Homayoun, T. Mohsenin","doi":"10.1109/ISQED48828.2020.9137013","DOIUrl":null,"url":null,"abstract":"This paper presents an energy-efficient, domain-specific manycore accelerator also referred to as the “CSCMAC” - Cyclic Sparsely Connected Neural Network Manycore Accelerator, which effectively maps and executes deep neural networks (DNNs) compressed with cyclic sparsely connected (CSC) architectures. CSC layers are architectures that structurally compress and sparsify DNNs, which can reduce the memory footprint of fully connected (FC) layers from $O(N^{2})$ to $O(N\\log N)$ with respect to layers nodes, and is shown to be hardware implementable-friendly. We implement CSC layers for inference on a manycore unit, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring indicate that by replacing FC layers with CSC layers, we can achieve $46\\times$ and $6\\times$ compression respectively within a margin of 2% accuracy loss. A 64-cluster architecture of the CSCMAC is fully placed and routed using $65\\mathrm{nm}$, TSMC CMOS technology. The layout of each cluster occupies an area of $0.73\\ mm^{2}$ and consumes $230.2 \\mathrm{mW}$ power at 980 MHz clock frequency. Our proposed CSCMAC achieves $1.48\\times$ higher throughput and $1.49\\times$ lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves $85\\times$ higher throughput and consumes $66.4\\times$ lower energy compared to CPU implementation of the NVIDIA Jetson TX2 platform.","PeriodicalId":225828,"journal":{"name":"2020 21st International Symposium on Quality Electronic Design (ISQED)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 21st International Symposium on Quality Electronic Design (ISQED)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISQED48828.2020.9137013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
This paper presents an energy-efficient, domain-specific manycore accelerator also referred to as the “CSCMAC” - Cyclic Sparsely Connected Neural Network Manycore Accelerator, which effectively maps and executes deep neural networks (DNNs) compressed with cyclic sparsely connected (CSC) architectures. CSC layers are architectures that structurally compress and sparsify DNNs, which can reduce the memory footprint of fully connected (FC) layers from $O(N^{2})$ to $O(N\log N)$ with respect to layers nodes, and is shown to be hardware implementable-friendly. We implement CSC layers for inference on a manycore unit, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring indicate that by replacing FC layers with CSC layers, we can achieve $46\times$ and $6\times$ compression respectively within a margin of 2% accuracy loss. A 64-cluster architecture of the CSCMAC is fully placed and routed using $65\mathrm{nm}$, TSMC CMOS technology. The layout of each cluster occupies an area of $0.73\ mm^{2}$ and consumes $230.2 \mathrm{mW}$ power at 980 MHz clock frequency. Our proposed CSCMAC achieves $1.48\times$ higher throughput and $1.49\times$ lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves $85\times$ higher throughput and consumes $66.4\times$ lower energy compared to CPU implementation of the NVIDIA Jetson TX2 platform.