Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han
{"title":"Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications","authors":"Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han","doi":"10.1145/3486618","DOIUrl":"https://doi.org/10.1145/3486618","url":null,"abstract":"Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing, and speech recognition. However, their superior performance comes at the considerable cost of computational complexity, which greatly hinders their applications in many resource-constrained devices, such as mobile phones and Internet of Things (IoT) devices. Therefore, methods and techniques that are able to lift the efficiency bottleneck while preserving the high accuracy of DNNs are in great demand to enable numerous edge AI applications. This article provides an overview of efficient deep learning methods, systems, and applications. We start from introducing popular model compression methods, including pruning, factorization, quantization, as well as compact model design. To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization. We then cover efficient on-device training to enable user customization based on the local data on mobile devices. Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video, and natural language processing by exploiting their spatial sparsity and temporal/token redundancy. Finally, to support all these algorithmic advancements, we introduce the efficient deep learning system design from both software and hardware perspectives.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"119 1","pages":"1 - 50"},"PeriodicalIF":0.0,"publicationDate":"2022-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77941989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunjung Lee, Jaewan Choi, Wonkyung Jung, Byeongho Kim, Jaehyun Park, Hweesoo Kim, Jung Ho Ahn
{"title":"MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units","authors":"Sunjung Lee, Jaewan Choi, Wonkyung Jung, Byeongho Kim, Jaehyun Park, Hweesoo Kim, Jung Ho Ahn","doi":"10.1145/3497745","DOIUrl":"https://doi.org/10.1145/3497745","url":null,"abstract":"Mobile and edge devices become common platforms for inferring convolutional neural networks (CNNs) due to superior privacy and service quality. To reduce the computational costs of convolution (CONV), recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE). However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators. We propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DW-CONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6 ( times ) and reduces energy by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"91S 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77279928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Efficient LSTM Inference Accelerator for Real-Time Causal Prediction","authors":"Zhe Chen, H. T. Blair, Jason Cong","doi":"10.1145/3495006","DOIUrl":"https://doi.org/10.1145/3495006","url":null,"abstract":"Ever-growing edge applications often require short processing latency and high energy efficiency to meet strict timing and power budget. In this work, we propose that the compact long short-term memory (LSTM) model can approximate conventional acausal algorithms with reduced latency and improved efficiency for real-time causal prediction, especially for the neural signal processing in closed-loop feedback applications. We design an LSTM inference accelerator by taking advantage of the fine-grained parallelism and pipelined feedforward and recurrent updates. We also propose a bit-sparse quantization method that can reduce the circuit area and power consumption by replacing the multipliers with the bit-shift operators. We explore different combinations of pruning and quantization methods for energy-efficient LSTM inference on datasets collected from the electroencephalogram (EEG) and calcium image processing applications. Evaluation results show that our proposed LSTM inference accelerator can achieve 1.19 GOPS/mW energy efficiency. The LSTM accelerator with 2-sbit/16-bit sparse quantization and 60% sparsity can reduce the circuit area and power consumption by 54.1% and 56.3%, respectively, compared with a 16-bit baseline implementation.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"31 1","pages":"1 - 19"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83732174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices","authors":"Jooyeon Lee, Junsang Park, Seunghyun Lee, J. Kung","doi":"10.1145/3513085","DOIUrl":"https://doi.org/10.1145/3513085","url":null,"abstract":"Recent advances in deep learning have made it possible to implement artificial intelligence in mobile devices. Many studies have put a lot of effort into developing lightweight deep learning models optimized for mobile devices. To overcome the performance limitations of manually designed deep learning models, an automated search algorithm, called neural architecture search (NAS), has been proposed. However, studies on the effect of hardware architecture of the mobile device on the performance of NAS have been less explored. In this article, we show the importance of optimizing a hardware architecture, namely, NPU dataflow, when searching for a more accurate yet fast deep learning model. To do so, we first implement an optimization framework, named FlowOptimizer, for generating a best possible NPU dataflow for a given deep learning operator. Then, we utilize this framework during the latency-aware NAS to find the model with the highest accuracy satisfying the latency constraint. As a result, we show that the searched model with FlowOptimizer outperforms the performance by 87.1% and 92.3% on average compared to the searched model with NVDLA and Eyeriss, respectively, with better accuracy on a proxy dataset. We also show that the searched model can be transferred to a larger model to classify a more complex image dataset, i.e., ImageNet, achieving 0.2%/5.4% higher Top-1/Top-5 accuracy compared to MobileNetV2-1.0 with 3.6 ( times ) lower latency.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"1 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79238963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aidin Shiri, Utteja Kallakuri, Hasib-Al Rashid, Bharat Prakash, Nicholas R. Waytowich, T. Oates, T. Mohsenin
{"title":"E2HRL: An Energy-efficient Hardware Accelerator for Hierarchical Deep Reinforcement Learning","authors":"Aidin Shiri, Utteja Kallakuri, Hasib-Al Rashid, Bharat Prakash, Nicholas R. Waytowich, T. Oates, T. Mohsenin","doi":"10.1145/3498327","DOIUrl":"https://doi.org/10.1145/3498327","url":null,"abstract":"Recently, Reinforcement Learning (RL) has shown great performance in solving sequential decision-making and control in dynamic environment problems. Despite its achievements, deploying Deep Neural Network (DNN)-based RL is expensive in terms of time and power due to the large number of episodes required to train agents with high dimensional image representations. Additionally, at the interference the large energy footprint of deep neural networks can be a major drawback. Embedded edge devices as the main platform for deploying RL applications are intrinsically resource-constrained and deploying deep neural network-based RL on them is a challenging task. As a result, reducing the number of actions taken by the RL agent to learn desired policy, along with the energy-efficient deployment of RL, is crucial. In this article, we propose Energy Efficient Hierarchical Reinforcement Learning (E2HRL), which is a scalable hardware architecture for RL applications. E2HRL utilizes a cross-layer design methodology for achieving better energy efficiency, smaller model size, higher accuracy, and system integration at the software and hardware layers. Our proposed model for RL agent is designed based on the learning hierarchical policies, which makes the network architecture more efficient for implementation on mobile devices. We evaluated our model in three different RL environments with different level of complexity. Simulation results with our analysis illustrate that hierarchical policy learning with several levels of control improves RL agents training efficiency and the agent learns the desired policy faster compared to a non-hierarchical model. This improvement is specifically more observable as the environment or the task becomes more complex with multiple objective subgoals. We tested our model with different hyperparameters to achieve the maximum reward by the RL agent while minimizing the model size, parameters, and required number of operations. E2HRL model enables efficient deployment of RL agent on resource-constraint-embedded devices with the proposed custom hardware architecture that is scalable and fully parameterized with respect to the number of input channels, filter size, and depth. The number of processing engines (PE) in the proposed hardware can vary between 1 to 8, which provides the flexibility of tradeoff of different factors such as latency, throughput, power, and energy efficiency. By performing a systematic hardware parameter analysis and design space exploration, we implemented the most energy-efficient hardware architectures of E2HRL on Xilinx Artix-7 FPGA and NVIDIA Jetson TX2. Comparing the implementation results shows Jetson TX2 boards achieve 0.1 ∼ 1.3 GOP/S/W energy efficiency while Artix-7 FPGA achieves 1.1 ∼ 11.4 GOP/S/W, which denotes 8.8× ∼ 11× better energy efficiency of E2HRL when model is implemented on FPGA. Additionally, compared to similar works our design shows better performance and energy efficiency.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"77 1","pages":"1 - 19"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90626787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy Efficient Boosting of GEMM Accelerators for DNN via Reuse","authors":"Nihat Mert Cicek, Xipeng Shen, O. Ozturk","doi":"10.1145/3503469","DOIUrl":"https://doi.org/10.1145/3503469","url":null,"abstract":"Reuse-centric convolutional neural networks (CNN) acceleration speeds up CNN inference by reusing computations for similar neuron vectors in CNN’s input layer or activation maps. This new paradigm of optimizations is, however, largely limited by the overheads in neuron vector similarity detection, an important step in reuse-centric CNN. This article presents an in-depth exploration of architectural support for reuse-centric CNN. It addresses some major limitations of the state-of-the-art design and proposes a novel hardware accelerator that improves neuron vector similarity detection and reduces the energy consumption of reuse-centric CNN inference. The accelerator is implemented to support a wide variety of neural network settings with a banked memory subsystem. Design exploration is performed through RTL simulation and synthesis on an FPGA platform. When integrated into Eyeriss, the accelerator can potentially provide improvements up to 7.75 ( times ) in performance. Furthermore, it can reduce the energy used for similarity detection up to 95.46%, and it can accelerate the convolutional layer up to 3.63 ( times ) compared to the software-based implementation running on the CPU.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"29 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86232619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathan Laubeuf, J. Doevenspeck, I. Papistas, Michele Caselli, S. Cosemans, Peter Vrancx, Debjyoti Bhattacharjee, A. Mallik, P. Debacker, D. Verkest, F. Catthoor, R. Lauwereins
{"title":"Dynamic Quantization Range Control for Analog-in-Memory Neural Networks Acceleration","authors":"Nathan Laubeuf, J. Doevenspeck, I. Papistas, Michele Caselli, S. Cosemans, Peter Vrancx, Debjyoti Bhattacharjee, A. Mallik, P. Debacker, D. Verkest, F. Catthoor, R. Lauwereins","doi":"10.1145/3498328","DOIUrl":"https://doi.org/10.1145/3498328","url":null,"abstract":"Analog in Memory Computing (AiMC) based neural network acceleration is a promising solution to increase the energy efficiency of deep neural networks deployment. However, the quantization requirements of these analog systems are not compatible with state-of-the-art neural network quantization techniques. Indeed, while the quantization of the weights and activations is considered by modern deep neural network quantization techniques, AiMC accelerators also impose the quantization of each Matrix Vector Multiplication (MVM) result. In most demonstrated AiMC implementations, the quantization range of MVM results is considered a fixed parameter of the accelerator. This work demonstrates that dynamic control over this quantization range is possible but also desirable for analog neural networks acceleration. An AiMC compatible quantization flow coupled with a hardware aware quantization range driving technique is introduced to fully exploit these dynamic ranges. Using CIFAR-10 and ImageNet as benchmarks, the proposed solution results in networks that are both more accurate and more robust to the inherent vulnerability of analog circuits than fixed quantization range based approaches.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"57 1","pages":"1 - 21"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82342445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization","authors":"Yue Tang, Xinyi Zhang, Peipei Zhou, Jingtong Hu","doi":"10.1145/3505633","DOIUrl":"https://doi.org/10.1145/3505633","url":null,"abstract":"Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward and backward propagation and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"46 1","pages":"1 - 36"},"PeriodicalIF":0.0,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78804329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NoC Application Mapping Optimization Using Reinforcement Learning","authors":"S. Jagadheesh, P. V. Bhanu, S. J","doi":"10.1145/3510381","DOIUrl":"https://doi.org/10.1145/3510381","url":null,"abstract":"Application mapping is one of the early stage design processes aimed to improve the performance of Network-on-Chip. Mapping is an NP-hard problem. A massive amount of high-quality supervised data is required to solve the application mapping problem using traditional neural networks. In this article, a reinforcement learning–based neural framework is proposed to learn the heuristics of the application mapping problem. The proposed reinforcement learning–based mapping algorithm (RL-MAP) has actor and critic networks. The actor is a policy network, which provides mapping sequences. The critic network estimates the communication cost of these mapping sequences. The actor network updates the policy distribution in the direction suggested by the critic. The proposed RL-MAP is trained with unsupervised data to predict the permutations of the cores to minimize the overall communication cost. Further, the solutions are improved using the 2-opt local search algorithm. The performance of RL-MAP is compared with a few well-known heuristic algorithms, the Neural Mapping Algorithm (NMA) and message-passing neural network-pointer network-based genetic algorithm (MPN-GA). Results show that the communication cost and runtime of the RL-MAP improved considerably in comparison with the heuristic algorithms. The communication cost of the solutions generated by RL-MAP is nearly equal to MPN-GA and improved by 4.2% over NMA, while consuming less runtime.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"27 1","pages":"1 - 16"},"PeriodicalIF":0.0,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90518626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design Automation Algorithms for the NP-Separate VLSI Design Methodology","authors":"Monzurul Islam Dewan, D. Kim","doi":"10.1145/3508375","DOIUrl":"https://doi.org/10.1145/3508375","url":null,"abstract":"The NP-Separate design methodology for very-large-scale integration (VLSI) design fine-controls the sizes of transistors, thereby achieving significant power, performance, and area improvement compared to the conventional standard-cell-based design methodology. NP-Separate uses NP cells formed by merging and routing N and P cells having only NFETs and PFETs, respectively. The NP cell formation, however, should be automated to design large circuits using the NP-Separate design methodology. In this paper, we propose design automation algorithms to create NP cells automatically. Simulation results show that the automated NP-Separate reduces the design time significantly, decreases the coupling capacitance by 13%, the critical path delay by 6%, and the power consumption by 10% on average compared to the manual NP-Separate designs. We also propose a detailed placement algorithm to generate more compact VLSI layouts with a little wirelength overhead. The combined effect reduces the coupling capacitance by 10%, the critical path delay by 5%, and the power consumption by 10% on average compared to the manual NP-Separate designs.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"31 1","pages":"1 - 20"},"PeriodicalIF":0.0,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82805407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}