Matteo Perotti, Matheus A. Cavalcante, Nils Wistoff, Renzo Andri, L. Cavigelli, L. Benini
{"title":"A “New Ara” for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design","authors":"Matteo Perotti, Matheus A. Cavalcante, Nils Wistoff, Renzo Andri, L. Cavigelli, L. Benini","doi":"10.1109/ASAP54787.2022.00017","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00017","url":null,"abstract":"Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130134244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mask-Net: A Hardware-efficient Object Detection Network with Masked Region Proposals","authors":"Han-Chen Chen, Cong Hao","doi":"10.1109/ASAP54787.2022.00030","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00030","url":null,"abstract":"Object detection on embedded systems is challenging because it is hard to achieve real-time inference with low energy consumption and limited hardware resources. Another challenge is to find hardware-friendly methods to avoid redundant computation. To address these challenges, in this work, we propose Mask-Net, a hardware-efficient object detection network with masked region proposals in regular shapes. First, we propose a hardware-friendly region proposal method to avoid redundant computation as much as possible and as early as possible, with slight or no accuracy loss. Second, we demonstrate that our method is generalizable by applying it to several detection backbones including SkyNet, ResNet-18 and UltraNet. Our method performs well in different scenarios, including DAC-SDC dataset, UAV123 dataset and OTB100 dataset. We choose SkyNet as our base model to design an accelerator and verify our design on Xilinx ZCU106 FPGA. We observe a speedup of 1.3× and about 30% energy consumption reduction when the FPGA runs at different frequencies from 124 MHz to 214 MHz with only a slight accuracy loss. We also conduct a design space exploration and demonstrate that our accelerator can achieve a theoretical speedup of 1.76× with masked region proposals. This is achieved by optimally allocating DSPs to different parts of the accelerator to balance the computations before and after the mask.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116932195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IMEC: A Memory-Efficient Convolution Algorithm For Quantised Neural Network Accelerators","authors":"Eashan Wadhwa, Shashwat Khandelwal, Shanker Shreejith","doi":"10.1109/ASAP54787.2022.00027","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00027","url":null,"abstract":"Quantised convolution neural networks (QCNNs) on FPGAs have shown tremendous potential for deploying deep learning on resource constrained devices closer to the data source or in embedded applications. An essential building block of (Q)CNNs are the convolutional layers. FPGA implementations use modified versions of convolution kernels to reduce the resource overheads using variations of the sliding kernel algorithm. While these alleviate resource consumption to a certain degree, they still incur considerable (distributed) memory resources, requiring the use of larger FPGA devices with sufficient on-chip memory elements to implement deep QCNNs. In this paper, we present the Inverse Memory Efficient Convolution (IMEC) algorithm, a novel strategy to lower the memory consumption of convolutional layers in QCNNs. IMEC lowers the footprint of intermediate matrix buffers incurred within the convolutional layers and the multiply-accumulate (MAC) operators required at each layer through a series of data organisation and computational optimisations. We evaluate IMEC by integrating it into the BNN-PYNQ framework that can compile high-level QCNN representations to the FPGA bitstream. Our results show that IMEC can optimise memory footprint and the overall resource overhead of the convolutional layers by ~33% and ~20% (LUT and FF count) respectively, across multiple quantisation levels (1-bit to 8-bit), while maintaining identical inference accuracy as the state-of-the-art QCNN implementations.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122361073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthias Nickel, Lester Kalms, Tim Häring, D. Göhringer
{"title":"High-Performance AKAZE Implementation Including Parametrizable and Generic HLS Modules","authors":"Matthias Nickel, Lester Kalms, Tim Häring, D. Göhringer","doi":"10.1109/ASAP54787.2022.00031","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00031","url":null,"abstract":"The amount of image data to be processed has increased tremendously over the last decades. One major computer vision task is the extraction of information to find patterns in and between images. One well-studied pattern recognition algorithm is AKAZE which builds a nonlinear scale space to detect features. While being more efficient compared to its predecessor KAZE, the computational demands of AKAZE are still high. Since many real-world computer vision applications require fast computations, sometimes under hard power and time constraints, FPGAs became a focus as a suitable target platform. This work presents a highly modularized and parameterizable implementation of the AKAZE feature detection algorithm integrated into HiFlipVX, which is a High-Level Synthesis library based on the OpenVX standard. The fine granular modularization and the generic design of the implemented functions allows them to be easily reused, increasing the workflow for other computer vision algorithms. The high degree of parameterization and extension of the library enables also a fast and extensive exploration of the design space. The proposed design achieved a high repeatability and frame rate of up to 480 frames per second for an image resolution of 1920×1080 compared to related work.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128253849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Special Session on European Acceleration Technologies","authors":"","doi":"10.1109/asap54787.2022.00011","DOIUrl":"https://doi.org/10.1109/asap54787.2022.00011","url":null,"abstract":"","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125497808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-precision logarithmic arithmetic for neural network accelerators","authors":"Maxime Christ, F. D. Dinechin, F. Pétrot","doi":"10.1109/ASAP54787.2022.00021","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00021","url":null,"abstract":"Resource requirements for hardware acceleration of neural networks inference is notoriously high, both in terms of computation and storage. One way to mitigate this issue is to quantize parameters and activations. This is usually done by scaling and centering the distributions of weights and activations, on a kernel per kernel basis, so that a low-precision binary integer representation can be used. This work studies low-precision logarithmic number system (LNS) as an efficient alternative. Firstly, LNS has more dynamic than fixed-point for the same number of bits. Thus, when quantizing MNIST and CIFAR reference networks without retraining, the smallest format size achieving top-1 accuracy comparable to floating-point is 1 to 3 bits smaller with LNS than with fixed-point. In addition, it is shown that the zero bit of classical LNS is not needed in this context, and that the sign bit can be saved for activations. The proposed LNS neuron is detailed and its implementation on FPGA is shown to be smaller and faster than a fixed-point one for comparable accuracy. Secondly, low-precision LNS enables efficient inference architectures where 1 / multiplications reduce to additions; 2/ the weighted inputs are converted to classical linear domain, but the tables needed for this conversion remain very small thanks to the low precision; and 3/ the conversion of the output activation back to LNS can be merged with an arbitrary activation function.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125695746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Murray L. Kornelsen, S. H. Mozafari, J. Clark, B. Meyer, W. Gross
{"title":"Fast Heterogeneous Task Mapping for Reducing Edge DNN Latency","authors":"Murray L. Kornelsen, S. H. Mozafari, J. Clark, B. Meyer, W. Gross","doi":"10.1109/ASAP54787.2022.00020","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00020","url":null,"abstract":"To meet DNN inference latency constraints on resource-constrained edge devices, we employ heterogeneous computing, utilizing multiple processing elements (e.g. CPU + GPU) accelerate inference. This leads to the challenge of efficiently mapping DNN operations to heterogeneous processing elements. For this task, we introduce a novel genetic algorithm (GA) optimizer. Through intelligent initialization and a customized mutation operation, we are able to evaluate 20x fewer generations while finding superior configurations compared with a baseline GA. Using our mapping optimizer, we find device placement configurations that achieve 15%, 24%, and 31% inference speed-up for BERT, SqueezeBERT, and InceptionV3,respectively.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127723836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aggressive Performance Improvement on Processing-in-Memory Devices by Adopting Hugepages","authors":"P. C. Santos, Bruno E. Forlin, M. Alves, L. Carro","doi":"10.1109/ASAP54787.2022.00019","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00019","url":null,"abstract":"Processing-in-Memory (PIM) devices integrated into general-purpose systems demand virtual memory support. In this way, these devices can be seamlessly coupled to the software stack, while maintaining compatibility and security provided by address management via the Operating System (OS) without requiring disruptive programming efforts. Typically, PIM intends to access large volumes of data via vector operations, and thus can suffer severe penalties due to the high cost of page misses in the Translation Look-aside Buffer (TLB). Our study demonstrates the criticality of such penalties on the system's performance and that PIM must resort to large page sizes. The presented results exploit the native large pages available on the host, and they show substantial performance improvements $(84times)$ for wide-vector PIM operations with large pages.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126843147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design Space Exploration for Memory-Oriented Approximate Computing Techniques","authors":"Hugo Miomandre, J. Nezan, D. Ménard","doi":"10.1109/ASAP54787.2022.00028","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00028","url":null,"abstract":"Modern digital systems are processing more and more data. This increase in memory requirements must match the processing capabilities and interconnections to avoid the memory wall. Approximate computing techniques exist to alleviate these requirements but usually require a thorough and tedious analysis of the processing pipeline. This paper presents an application-agnostic Design Space Exploration (DSE) of the buffer-sizing process to reduce the memory footprint of applications while guaranteeing an output quality above a defined threshold. The proposed DSE selects the appropriate bit-width and storage type for buffers to satisfy the constraint. We show in this paper that the proposed DSE reduces the memory footprint of the SqueezeNet CNN by 58.6% with identical Top-1 prediction accuracy, and the full SKA SDP pipeline by 39.7% without degradation, while only testing for a subset of the design space. The proposed DSE is fast enough to be integrated into the design stream of applications.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128145349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julian Haase, Sebastian Jaster, Elke Franz, D. Göhringer
{"title":"Secure Communication Protocol for Network-on-Chip with Authenticated Encryption and Recovery Mechanism","authors":"Julian Haase, Sebastian Jaster, Elke Franz, D. Göhringer","doi":"10.1109/ASAP54787.2022.00033","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00033","url":null,"abstract":"In recent times, Network-on-Chip (NoC) has become state of the art for communication in Multiprocessor System-on-Chip due to the existing scalability issues in this area. However, these systems are exposed to security threats such as extraction of secret information. Therefore, the need for secure communication arises in such environments. In this work, we present a communication protocol based on authenticated encryption with recovery mechanisms to establish secure end-to-end communication between the NoC nodes. In addition, a selected key agreement approach required for secure communication is implemented. The security functionality is located in the network adapter of each processing element. If data is tampered with or deleted during transmission, recovery mechanisms ensure that the corrupted data is retransmitted by the network adapter without the need of interference from the processing element. We simulated and implemented the complete system with SystemC TLM using the NoC simulation platform PANACA. Our results show that we can keep a high rate of correctly transmitted information even when attackers infiltrated the NoC system.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116897229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}