{"title":"Reinforcement Learning based Efficient Mapping of DNN Models onto Accelerators","authors":"Shine Parekkadan Sunny, Satyajit Das","doi":"10.1109/coolchips54332.2022.9772673","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772673","url":null,"abstract":"The input tensors in each layer of Deep Neural Network (DNN) models are often partitioned/tiled to get accommodated in the limited on-chip memory of accelerators. Studies show that efficient tiling schedules (commonly referred to as mapping) for a given accelerator and DNN model reduce the data movement between the accelerator and different levels of the memory hierarchy improving the performance. However, finding layer-wise optimum mapping for a target architecture with a given energy and latency envelope is an open problem due to the huge search space in the mappings. In this paper, we propose a Reinforcement Learning (RL) based automated mapping approach to find optimum schedules of DNN layers for a given architecture model without violating the specified energy and latency constraints. The learned policies easily adapt to a wide range of DNN models with different hardware configurations, facilitating transfer learning improving the training time. Experiments show that the proposed work improves latency and energy consumption by an average of 21.5% and 15.6% respectively compared to the state-of-the-art genetic algorithm-based GAMMA approach for a wide range of DNN models running on NVIDIA Deep Learning Accelerator (NVDLA). The training time of RL-based transfer learning is 15× faster than that of GAMMA.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127722938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Encoder-based Many-Pattern Matching on FPGAs","authors":"H. Vu, Ngoc-Dai Bui","doi":"10.1109/coolchips54332.2022.9772671","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772671","url":null,"abstract":"Many-pattern matching is one of the most essential algorithms in many application domains, such as data mining, network security, and bioinformatics. Such high-throughput application domains require high-performance matching engines, leading to the deployment of the algorithm on hardware. However, such hardware deployment consumes a large number of hardware resources. This challenge becomes more critical when scaling the number of patterns as well as the data throughput. In this paper, we first proposed an encoder-based hardware architecture for many-pattern matching on FPGAs. The matching architecture includes two parts: encoder-based filter and matching block. We also proposed an algorithm to simplify the structure of the encoder-based filter, thus reducing the hardware utilization. The hardware architecture is scalable with the number of patterns and the input data throughput. We evaluated our matching architecture and our algorithm with 2048 32-byte patterns abstracted from Snort rules for malware. The evaluation on Xilinx Zedboard shows that at 2.16 Gbps throughput, the proposed architecture achieves higher hardware efficiency at 0.05 LUTs per character, a block RAM consumption 10% of total device, and almost no flip-flop consumption, while the maximum clock frequency and the latency are 270 MHz and 11 ns, respectively.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128818076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moritz Scherer, Alfio Di Mauro, Georg Rutishauser, Tim Fischer, L. Benini
{"title":"A 1036 TOp/s/W, 12.2 mW, 2.72 μJ/Inference All Digital TNN Accelerator in 22 nm FDX Technology for TinyML Applications","authors":"Moritz Scherer, Alfio Di Mauro, Georg Rutishauser, Tim Fischer, L. Benini","doi":"10.1109/coolchips54332.2022.9772668","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772668","url":null,"abstract":"Tiny Machine Learning (TinyML) applications impose μJ/Inference constraints, with maximum power consumption of a few tens of mW. It is extremely challenging to meet these requirement at a reasonable accuracy level. In this work, we address this challenge with a flexible, fully digital Ternary Neural Network (TNN) accelerator in a RISC-V-based SoC. The design achieves 2.72 μJ/Inference, 12.2 mW, 3200 Inferences/sec at 0.5 V for a non-trivial 9-layer, 96 channels-per-layer network with CIFAR-10 accuracy of 86 %. The peak energy efficiency is 1036 TOp/s/W, outperforming the state-of-the-art in silicon-proven TinyML accelerators by 1.67x.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"55 Pt B 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122598171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takuya Kojima, Hayate Okuhara, Masaaki Kondo, H. Amano
{"title":"Body Bias Control on a CGRA based on Convex Optimization","authors":"Takuya Kojima, Hayate Okuhara, Masaaki Kondo, H. Amano","doi":"10.1109/coolchips54332.2022.9772708","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772708","url":null,"abstract":"Body biasing is one of the critical techniques to realize more energy-efficient computing with reconfigurable devices, such as Coarse-Grained Reconfigurable Architectures (CGRAs). Its benefit depends on the control granularity, whereas fine-grained control makes it challenging to find the best body bias voltage for each domain due to the complexity of the optimization problem. This work reformulates the optimization problem and introduces continuous relaxation to solve it faster than previous work. Experimental result shows the proposed method can solve the problem within 0.5 sec for all benchmarks in any conditions and demonstrates up to 5.65x speed-up compared to the previous method with negligible loss of accuracy.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"317 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133857211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session III Panel Discussions: The Future of Mission-critical, Mixed-criticality High-performance Embedded Systems","authors":"","doi":"10.1109/coolchips54332.2022.9772707","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772707","url":null,"abstract":"","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"12 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133895304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongseok Im, Gwangtae Park, Junha Ryu, Zhiyong Li, Sanghoon Kang, Donghyeon Han, Jinsu Lee, Wonhoon Park, Hankyul Kwon, H. Yoo
{"title":"A Low-power and Real-time 3D Object Recognition Processor with Dense RGB-D Data Acquisition in Mobile Platforms","authors":"Dongseok Im, Gwangtae Park, Junha Ryu, Zhiyong Li, Sanghoon Kang, Donghyeon Han, Jinsu Lee, Wonhoon Park, Hankyul Kwon, H. Yoo","doi":"10.1109/coolchips54332.2022.9772667","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772667","url":null,"abstract":"A low-power and real-time 3D object recognition with RGBD data acquisition system-on-chip (SoC) is proposed. By synthesizing dense RGB-D data through monocular depth estimation, the proposed system reduces the sensor power for 3D data acquisition by ×27.3 lower. Moreover, the proposed processor reduces the energy consumption of a point cloud based neural network (PNN) exploiting bit-slice-level computation and a point feature reuse method with a pipelined architecture. Additionally, the processor supports the point sampling and grouping algorithms of the PNN with a unified point processing core. Finally, the processor achieves 210.0 mW while implementing 34.0 frame-per-second (fps) end-to-end RGB-D acquisition and 3D object recognition.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"57 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117218549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DXT501:An SDR-Based Baseband MP-SoC for Multi-Protocol Industrial Wireless Communication","authors":"Yang Chen, Lin Liu, Xuelin Feng, Jinglin Shi","doi":"10.1109/coolchips54332.2022.9772697","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772697","url":null,"abstract":"This paper design and implement an SDR-based baseband MP-SoC DXT501. It contains four high-performance 32-bit ASIPs, a real-time 32-bit RISC processor, a high-performance dual-core 32-bit GP processor ARC HS47Dx2, and some hardware accelerators that support LTE, 4G, MulteFire, and 5G(Release15). What's more, a mobile device solution supporting multiple protocols is proposed. The practical test shows that the mobile device running on the MulteFire 1.1 protocol in the unlicensed frequency band has a transmission capacity of more than 300Mbps in uplink and 150Mbps in downlink, which can meet the requirements of modern industrial wireless communication applications such as mobile inspection robots.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129704435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Memcapacitive Spiking Neural Network with Circuit Nonlinearity-aware Training","authors":"Reon Oshio, Sugahara Takuya, Atsushi Sawada, Mutsumi Kimura, Renyuan Zhang, Y. Nakashima","doi":"10.1109/coolchips54332.2022.9772674","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772674","url":null,"abstract":"Neuromorphic computing is an unconventional computing scheme that executes computable algorithms using Spiking Neural Networks (SNNs) mimicking neural dynamics with high speed and low power consumption by the dedicated hardware. The analog implementation of neuromorphic computing has been studied in the field of edge computing etc. and is considered to be superior to the digital implementation in terms of power consumption. Furthermore, It is expected to have extremely low power consumption that Processing-In-Memory (PIM) based synaptic operations using non-volatile memory (NVM) devices for both weight memory and multiply-accumulate operations. However, unintended non-linearities and hysteresis occur when attempting to implement analog spiking neuron circuits as simply as possible. As a result, it is thought to cause accuracy loss when inference is performed by mapping the weight parameters of the SNNs which trained offline to the element parameters of the NVM. In this study, we newly designed neuromorphic hardware operating at 100 MHz that employs memcapacitor as a synaptic element, which is expected to have ultra-low power consumption. We also propose a method for training SNNs that incorporate the nonlinearity of the designed circuit into the neuron model and convert the synaptic weights into circuit element parameters. The proposed training method can reduce the degradation of accuracy even for very simple neuron circuits. The proposed circuit and method classify MNIST with ∼33.88 nJ/Inference, excluding the encoder, with ∼97% accuracy. The circuit design and measurement of circuit characteristics were performed in Rohm 180nm process using HSPICE. A spiking neuron model that incorporates circuit non-linearity as an activation function was implemented in PyTorch, a machine learning framework for Python.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131489833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kensuke Iizuka, Haruna Takagi, Aika Kamei, Kazuei Hironaka, H. Amano
{"title":"Power Analysis of Directly-connected FPGA Clusters","authors":"Kensuke Iizuka, Haruna Takagi, Aika Kamei, Kazuei Hironaka, H. Amano","doi":"10.1109/coolchips54332.2022.9772675","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772675","url":null,"abstract":"Although low power consumption is a significant advantage of FPGA clusters, almost no power analyses with real systems have been reported. This study reports the detailed power consumption analyses of two FPGA clusters, namely, M-KUBOS and FiC, with power measurement tools and real applications. In both clusters, the type of logic design shells determines the base power consumption. For building clusters, the power for node communication links is mainly determined by the number of activated links and not influenced by the number of actually used links. Therefore, applying the link aggregation technique does not affect the power consumption. Increasing the clock frequency of the application logic mildly increases the power consumption. The obtained results suggest that the best way to reduce the total power consumption of an FPGA cluster and improve its performance is to use the minimum number of links for the application, apply link aggregation, and aggressively increase the clock frequency.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123713524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Acceleration of Aggregate Signature Generation and Authentication by BLS Signature over BLS12-381 curve","authors":"Kaoru Masada, R. Nakayama, M. Ikeda","doi":"10.1109/coolchips54332.2022.9772706","DOIUrl":"https://doi.org/10.1109/coolchips54332.2022.9772706","url":null,"abstract":"BLS signature is a digital signature scheme computed over elliptic curves, and it has been attracting attention with its interesting function that signatures can be aggregated. We will introduce our progress of designing two ASIC architectures to accelerate the complex computations of generating and verifying signatures respectively. The computations include mapping to elliptic curves and pairing. An important subject of our work is to adopt a relatively new curve called BLS12-381. BLS12-381 is currently one of the curves that gather the most interests and yet very few ASIC implementations are optimized for BLS12-381.","PeriodicalId":266152,"journal":{"name":"2022 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125829342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}