{"title":"Optimizing CNN Accelerator With Improved Roofline Model","authors":"Shaoxia Fang, Shulin Zeng, Yu Wang","doi":"10.1109/socc49529.2020.9524754","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524754","url":null,"abstract":"The external memory I/O bandwidth is the most common performance bottleneck for Convolutional Neural Network(CNN) inference accelerators. On the other hand, performance is also affected by many other factors such as the on-chip memory size and data scheduling strategies, making it difficult to identify the root cause of performance degradation. This paper proposes an improved roofline model specifically for the CNN accelerator, which provides a deep understanding of the bandwidth bottlenecks and points out the direction of optimization. Previous roofline models have focused on modeling and optimizing each layer, while neglecting some high-level optimizations (e.g. layer fusion and batch processing) that alleviate the bandwidth requirements. However, the uneven cross-layer bandwidth requirements can have a significant impact on the overall performance, and the combination of independently optimized layers does not necessarily result in an overall optimal solution. Our model is capable of modeling more complex data scheduling strategies and enables a larger design space than previous roofline models. We use the Xilinx CNN accelerator on ZU9 FPGA as an example for quantitative analysis and optimization. We apply the optimization method derived from the improved roofline model to the original design and ultimately achieve a 1.6x performance improvement. The derived optimization method effectively solves the severe temporary bandwidth overload problem in the original design that leads to the computational inefficiency.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127386595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Welcome Message from the TPC Chairs","authors":"","doi":"10.1109/socc49529.2020.9524777","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524777","url":null,"abstract":"","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122009340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifeng Song, Danyang Zhu, Jing Tian, Zhongfeng Wang
{"title":"A High-Speed Architecture for the Reduction in VDF Based on a Class Group","authors":"Yifeng Song, Danyang Zhu, Jing Tian, Zhongfeng Wang","doi":"10.1109/socc49529.2020.9524783","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524783","url":null,"abstract":"Due to the enormous energy consuming involved in the proof of work (POW) process, the resource-efficient blockchain system is urged to be released. The verifiable delay function (VDF), being slow to compute and easy to verify, is believed to be the kernel function of the next-generation blockchain system. In general, the reduction over a class group, involving many complex operations, such as the large-number division and multiplication operations, takes a large portion in the VDF. In this paper, for the first time, we propose a highspeed architecture for the reduction by incorporating algorithmic transformations and architectural optimizations. Firstly, based on the fastest reduction algorithm, we present a modified version to make it more hardware-friendly by introducing a novel transformation method that can efficiently remove the large-number divisions. Secondly, highly parallelized and pipelined architectures are devised respectively for the large-number multiplication and addition operations to reduce the latency and the critical path. Thirdly, a compact state machine is developed to enable maximum overlapping in time for computations. The experiment results show that when computing 209715 reduction steps with the input width of 2048 bits, the proposed design only takes 137.652ms running on an Altera Stratix-10 FPGA at 100MHz frequency, while the original algorithm needs 3278ms when operating over an i7-6850K CPU at 3.6GHz frequency. Thus we have obtained a drastic speedup of nearly 24x over an advanced CPU.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134424076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Soliman, R. Olivo, T. Kirchner, M. Lederer, T. Kämpfe, A. Guntoro, N. Wehn
{"title":"A Ferroelectric FET Based In-memory Architecture for Multi-Precision Neural Networks","authors":"T. Soliman, R. Olivo, T. Kirchner, M. Lederer, T. Kämpfe, A. Guntoro, N. Wehn","doi":"10.1109/socc49529.2020.9524750","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524750","url":null,"abstract":"Computing-in-memory (CIM) is a promising approach to improve the throughput and the energy efficiency of deep neural network (DNN) processors. So far, resistive nonvolatile memories have been adapted to build crossbar-based accelerators for DNN inference. However, such structures suffer from several drawbacks such as sneak paths, large ADCs/DACs, high write energy, etc. In this paper we present a mixed signal in-memory hardware accelerator for CNNs. We propose an in-memory inference system that uses FeFETs as the main nonvolatile memory cell. We show how the proposed crossbar unit cell can overcome the aforementioned issues while reducing unit cell size and power consumption. The proposed system decomposes multi-bit operands down to single bit operations. We then re-combine them without any loss of precision using accumulators and shifters within the crossbar and across different crossbars. Simulations demonstrate that we can outperform state-of-the-art efficiencies with 3.28 TOPS/W and can pack 1.64 TOPS in an area of 1.52mm2using 22 nm FDSOI technology,","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132942196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Reinforcement Learning for Self-Configurable NoC","authors":"Md Farhadur Reza","doi":"10.1109/socc49529.2020.9524761","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524761","url":null,"abstract":"Network-on-Chips (NoCs) has been the superior interconnect fabric for multi/many-core on-chip systems because of its scalability and parallelism. On-chip network resources can be dynamically configured to improve the energy-efficiency and performance of NoC. However, large and complex design space in heterogeneous NoC architectures becomes difficult to explore within a reasonable time for optimal trade-offs of energy and performance. Furthermore, reactive resource management is not effective in preventing problems, such as creating thermal hotspots and exceeding chip power budget, from happening in adaptive systems. Therefore, we propose machine learning (ML) technique to provide proactive solution within an instant for both energy and performance efficiency. In this paper, we present deep reinforcement learning (deep RL) techniques to configure the voltage/frequency levels of both NoC routers and links in multicore architectures for energy-efficiency while providing high-performance NoC. We propose the use of reinforcement learning (RL) to configure the NoC resources intelligently based on system utilization and application demands. Additionally, neural networks (NNs) are used to approximate the actions of distributed RL agents in large-scale systems, to mitigate the large cost of traditional table-based RL. Simulations results for 256-core and 16-core NoC architectures under real-world benchmarks show that the proposed approach improves energy-delay product significantly (40%) when compared to traditional non-ML based solution. Furthermore, the proposed solution incurs very low energy and hardware overhead while providing self-configurable NoC to meet the real-time requirements of applications.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121933892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuchen Mei, Li Du, Xuewen He, Yuan Du, Xiaoliang Chen, Zhongfeng Wang
{"title":"A Reconfigurable Permutation Based Address Encryption Architecture for Memory Security","authors":"Yuchen Mei, Li Du, Xuewen He, Yuan Du, Xiaoliang Chen, Zhongfeng Wang","doi":"10.1109/socc49529.2020.9524762","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524762","url":null,"abstract":"Most of the existing memory encryption techniques in IoT devices are based on data encryption. The level of security increases at the cost of the increased encryption algorithm complexity, resulting in large power consumption and area overhead for high-security devices. In this paper, we take a significantly different approach to encrypt the device memory through address encryption. A reconfigurable architecture called Permutation based Address Encryption (PAE) is proposed, for the first time, to encrypt the device memory with minor hardware overhead and much shorter processing time. The architecture is synthesized in SMIC 40nm standard CMOS technology. Compared with Data Encryption Standard (DES), the proposed PAE achieves 16x encryption speed and 1.4x effective key length. When combined with the DES, the PAE+DES encryption outperforms existing hardware Advanced Encryption Standard (AES) with almost 2x in power efficiency, more than 1.5x in area efficiency and better security, making it a promising hardware encryption technique for IoT devices.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114781768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Configurable FPGA Accelerator of Bi-LSTM Inference with Structured Sparsity","authors":"Shouliang Guo, Chao Fang, Jun Lin, Zhongfeng Wang","doi":"10.1109/socc49529.2020.9524784","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524784","url":null,"abstract":"To deploy Bi-directional Long Short-Term Memory (Bi-LSTM) on resource-constrained embedded devices, this work presents a configurable FPGA-based Bi-LSTM accelerator enabling structured compression. Firstly, a dense Bi-LSTM model is thoroughly slimed by a hybrid quantization scheme and a structured top-k pruning. Secondly, the energy consumption on external memory access is significantly reduced by the proposed row-reuse computing pattern. Finally, the proposed accelerator is capable of handling a structured sparse Bi-LSTM model benefitting from the algorithm-hardware co-design workflow. It is also flexible to perform inference tasks on Bi-LSTM models with any feature dimension, sequence length, and number of layers. Implemented on the Intel Cyclone V SXC5 SoC FPGA platform, the proposed accelerator can achieve 189.69 GOPs on structured sparse Bi-LSTM networks without batching. Compared with the implementations on CPU and GPU, the low-cost FPGA accelerator achieves 43.5x and 6.3x speedup on latency, 520.9x and 46.5 x improvement on energy efficiency, respectively.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129837089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Bhasin, Trevor E. Carlson, A. Chattopadhyay, Vinay B. Y. Kumar, A. Mendelson, R. Poussier, Yaswanth Tavva
{"title":"Secure Your SoC: Building System-an-Chip Designs for Security","authors":"S. Bhasin, Trevor E. Carlson, A. Chattopadhyay, Vinay B. Y. Kumar, A. Mendelson, R. Poussier, Yaswanth Tavva","doi":"10.1109/socc49529.2020.9524760","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524760","url":null,"abstract":"Modern System-on-Chip designs (SoCs) are becoming increasingly complex and powerful, catering to a wide range of application domains. Their use in security-critical tasks calls for a holistic approach to SoC design, including security as a first-class architecture constraint, rather than adding security only as an afterthought. The problem is compounded by the inclusion of multiple, potentially untrusted, third party components in the SoC design. To address this challenge systematically, this paper explores four distinct and important aspects of designing secure SoCs. First, starting at the component level, an evaluation framework for assessing component security against physical attacks is proposed. Second, a scalable simulation framework is developed to integrate these secure components which offers flexibility for early- and late-stage SoC development. Third, dynamic and static techniques are proposed to determine when the system is under attack, with a key focus on Hardware Trojans as threat. Finally, a design strategy for integrating untrusted components into a SoC through hardware Root-of-Trust is outlined. For each of these aspects we present early-stage evaluations, and show how these complement each other towards the design of a secure SoC.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125910662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}