2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

筛选
英文 中文
A “New Ara” for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design 矢量计算的“新Ara”:开源高效RISC-V V 1.0矢量处理器设计
Matteo Perotti, Matheus A. Cavalcante, Nils Wistoff, Renzo Andri, L. Cavigelli, L. Benini
{"title":"A “New Ara” for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design","authors":"Matteo Perotti, Matheus A. Cavalcante, Nils Wistoff, Renzo Andri, L. Cavigelli, L. Benini","doi":"10.1109/ASAP54787.2022.00017","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00017","url":null,"abstract":"Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130134244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Mask-Net: A Hardware-efficient Object Detection Network with Masked Region Proposals Mask-Net:一种具有屏蔽区域的硬件高效目标检测网络
Han-Chen Chen, Cong Hao
{"title":"Mask-Net: A Hardware-efficient Object Detection Network with Masked Region Proposals","authors":"Han-Chen Chen, Cong Hao","doi":"10.1109/ASAP54787.2022.00030","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00030","url":null,"abstract":"Object detection on embedded systems is challenging because it is hard to achieve real-time inference with low energy consumption and limited hardware resources. Another challenge is to find hardware-friendly methods to avoid redundant computation. To address these challenges, in this work, we propose Mask-Net, a hardware-efficient object detection network with masked region proposals in regular shapes. First, we propose a hardware-friendly region proposal method to avoid redundant computation as much as possible and as early as possible, with slight or no accuracy loss. Second, we demonstrate that our method is generalizable by applying it to several detection backbones including SkyNet, ResNet-18 and UltraNet. Our method performs well in different scenarios, including DAC-SDC dataset, UAV123 dataset and OTB100 dataset. We choose SkyNet as our base model to design an accelerator and verify our design on Xilinx ZCU106 FPGA. We observe a speedup of 1.3× and about 30% energy consumption reduction when the FPGA runs at different frequencies from 124 MHz to 214 MHz with only a slight accuracy loss. We also conduct a design space exploration and demonstrate that our accelerator can achieve a theoretical speedup of 1.76× with masked region proposals. This is achieved by optimally allocating DSPs to different parts of the accelerator to balance the computations before and after the mask.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116932195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
IMEC: A Memory-Efficient Convolution Algorithm For Quantised Neural Network Accelerators 一种用于量化神经网络加速器的高效内存卷积算法
Eashan Wadhwa, Shashwat Khandelwal, Shanker Shreejith
{"title":"IMEC: A Memory-Efficient Convolution Algorithm For Quantised Neural Network Accelerators","authors":"Eashan Wadhwa, Shashwat Khandelwal, Shanker Shreejith","doi":"10.1109/ASAP54787.2022.00027","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00027","url":null,"abstract":"Quantised convolution neural networks (QCNNs) on FPGAs have shown tremendous potential for deploying deep learning on resource constrained devices closer to the data source or in embedded applications. An essential building block of (Q)CNNs are the convolutional layers. FPGA implementations use modified versions of convolution kernels to reduce the resource overheads using variations of the sliding kernel algorithm. While these alleviate resource consumption to a certain degree, they still incur considerable (distributed) memory resources, requiring the use of larger FPGA devices with sufficient on-chip memory elements to implement deep QCNNs. In this paper, we present the Inverse Memory Efficient Convolution (IMEC) algorithm, a novel strategy to lower the memory consumption of convolutional layers in QCNNs. IMEC lowers the footprint of intermediate matrix buffers incurred within the convolutional layers and the multiply-accumulate (MAC) operators required at each layer through a series of data organisation and computational optimisations. We evaluate IMEC by integrating it into the BNN-PYNQ framework that can compile high-level QCNN representations to the FPGA bitstream. Our results show that IMEC can optimise memory footprint and the overall resource overhead of the convolutional layers by ~33% and ~20% (LUT and FF count) respectively, across multiple quantisation levels (1-bit to 8-bit), while maintaining identical inference accuracy as the state-of-the-art QCNN implementations.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122361073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance AKAZE Implementation Including Parametrizable and Generic HLS Modules 高性能AKAZE实现,包括可参数化和通用HLS模块
Matthias Nickel, Lester Kalms, Tim Häring, D. Göhringer
{"title":"High-Performance AKAZE Implementation Including Parametrizable and Generic HLS Modules","authors":"Matthias Nickel, Lester Kalms, Tim Häring, D. Göhringer","doi":"10.1109/ASAP54787.2022.00031","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00031","url":null,"abstract":"The amount of image data to be processed has increased tremendously over the last decades. One major computer vision task is the extraction of information to find patterns in and between images. One well-studied pattern recognition algorithm is AKAZE which builds a nonlinear scale space to detect features. While being more efficient compared to its predecessor KAZE, the computational demands of AKAZE are still high. Since many real-world computer vision applications require fast computations, sometimes under hard power and time constraints, FPGAs became a focus as a suitable target platform. This work presents a highly modularized and parameterizable implementation of the AKAZE feature detection algorithm integrated into HiFlipVX, which is a High-Level Synthesis library based on the OpenVX standard. The fine granular modularization and the generic design of the implemented functions allows them to be easily reused, increasing the workflow for other computer vision algorithms. The high degree of parameterization and extension of the library enables also a fast and extensive exploration of the design space. The proposed design achieved a high repeatability and frame rate of up to 480 frames per second for an image resolution of 1920×1080 compared to related work.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128253849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Special Session on European Acceleration Technologies 欧洲加速技术特别会议
{"title":"Special Session on European Acceleration Technologies","authors":"","doi":"10.1109/asap54787.2022.00011","DOIUrl":"https://doi.org/10.1109/asap54787.2022.00011","url":null,"abstract":"","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125497808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-precision logarithmic arithmetic for neural network accelerators 神经网络加速器的低精度对数算法
Maxime Christ, F. D. Dinechin, F. Pétrot
{"title":"Low-precision logarithmic arithmetic for neural network accelerators","authors":"Maxime Christ, F. D. Dinechin, F. Pétrot","doi":"10.1109/ASAP54787.2022.00021","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00021","url":null,"abstract":"Resource requirements for hardware acceleration of neural networks inference is notoriously high, both in terms of computation and storage. One way to mitigate this issue is to quantize parameters and activations. This is usually done by scaling and centering the distributions of weights and activations, on a kernel per kernel basis, so that a low-precision binary integer representation can be used. This work studies low-precision logarithmic number system (LNS) as an efficient alternative. Firstly, LNS has more dynamic than fixed-point for the same number of bits. Thus, when quantizing MNIST and CIFAR reference networks without retraining, the smallest format size achieving top-1 accuracy comparable to floating-point is 1 to 3 bits smaller with LNS than with fixed-point. In addition, it is shown that the zero bit of classical LNS is not needed in this context, and that the sign bit can be saved for activations. The proposed LNS neuron is detailed and its implementation on FPGA is shown to be smaller and faster than a fixed-point one for comparable accuracy. Secondly, low-precision LNS enables efficient inference architectures where 1 / multiplications reduce to additions; 2/ the weighted inputs are converted to classical linear domain, but the tables needed for this conversion remain very small thanks to the low precision; and 3/ the conversion of the output activation back to LNS can be merged with an arbitrary activation function.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125695746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Fast Heterogeneous Task Mapping for Reducing Edge DNN Latency 减少边缘DNN延迟的快速异构任务映射
Murray L. Kornelsen, S. H. Mozafari, J. Clark, B. Meyer, W. Gross
{"title":"Fast Heterogeneous Task Mapping for Reducing Edge DNN Latency","authors":"Murray L. Kornelsen, S. H. Mozafari, J. Clark, B. Meyer, W. Gross","doi":"10.1109/ASAP54787.2022.00020","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00020","url":null,"abstract":"To meet DNN inference latency constraints on resource-constrained edge devices, we employ heterogeneous computing, utilizing multiple processing elements (e.g. CPU + GPU) accelerate inference. This leads to the challenge of efficiently mapping DNN operations to heterogeneous processing elements. For this task, we introduce a novel genetic algorithm (GA) optimizer. Through intelligent initialization and a customized mutation operation, we are able to evaluate 20x fewer generations while finding superior configurations compared with a baseline GA. Using our mapping optimizer, we find device placement configurations that achieve 15%, 24%, and 31% inference speed-up for BERT, SqueezeBERT, and InceptionV3,respectively.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127723836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aggressive Performance Improvement on Processing-in-Memory Devices by Adopting Hugepages 采用大页面技术对内存中处理设备的性能进行积极改进
P. C. Santos, Bruno E. Forlin, M. Alves, L. Carro
{"title":"Aggressive Performance Improvement on Processing-in-Memory Devices by Adopting Hugepages","authors":"P. C. Santos, Bruno E. Forlin, M. Alves, L. Carro","doi":"10.1109/ASAP54787.2022.00019","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00019","url":null,"abstract":"Processing-in-Memory (PIM) devices integrated into general-purpose systems demand virtual memory support. In this way, these devices can be seamlessly coupled to the software stack, while maintaining compatibility and security provided by address management via the Operating System (OS) without requiring disruptive programming efforts. Typically, PIM intends to access large volumes of data via vector operations, and thus can suffer severe penalties due to the high cost of page misses in the Translation Look-aside Buffer (TLB). Our study demonstrates the criticality of such penalties on the system's performance and that PIM must resort to large page sizes. The presented results exploit the native large pages available on the host, and they show substantial performance improvements $(84times)$ for wide-vector PIM operations with large pages.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126843147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design Space Exploration for Memory-Oriented Approximate Computing Techniques 面向内存的近似计算技术的设计空间探索
Hugo Miomandre, J. Nezan, D. Ménard
{"title":"Design Space Exploration for Memory-Oriented Approximate Computing Techniques","authors":"Hugo Miomandre, J. Nezan, D. Ménard","doi":"10.1109/ASAP54787.2022.00028","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00028","url":null,"abstract":"Modern digital systems are processing more and more data. This increase in memory requirements must match the processing capabilities and interconnections to avoid the memory wall. Approximate computing techniques exist to alleviate these requirements but usually require a thorough and tedious analysis of the processing pipeline. This paper presents an application-agnostic Design Space Exploration (DSE) of the buffer-sizing process to reduce the memory footprint of applications while guaranteeing an output quality above a defined threshold. The proposed DSE selects the appropriate bit-width and storage type for buffers to satisfy the constraint. We show in this paper that the proposed DSE reduces the memory footprint of the SqueezeNet CNN by 58.6% with identical Top-1 prediction accuracy, and the full SKA SDP pipeline by 39.7% without degradation, while only testing for a subset of the design space. The proposed DSE is fast enough to be integrated into the design stream of applications.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128145349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Secure Communication Protocol for Network-on-Chip with Authenticated Encryption and Recovery Mechanism 带认证加密和恢复机制的片上网络安全通信协议
Julian Haase, Sebastian Jaster, Elke Franz, D. Göhringer
{"title":"Secure Communication Protocol for Network-on-Chip with Authenticated Encryption and Recovery Mechanism","authors":"Julian Haase, Sebastian Jaster, Elke Franz, D. Göhringer","doi":"10.1109/ASAP54787.2022.00033","DOIUrl":"https://doi.org/10.1109/ASAP54787.2022.00033","url":null,"abstract":"In recent times, Network-on-Chip (NoC) has become state of the art for communication in Multiprocessor System-on-Chip due to the existing scalability issues in this area. However, these systems are exposed to security threats such as extraction of secret information. Therefore, the need for secure communication arises in such environments. In this work, we present a communication protocol based on authenticated encryption with recovery mechanisms to establish secure end-to-end communication between the NoC nodes. In addition, a selected key agreement approach required for secure communication is implemented. The security functionality is located in the network adapter of each processing element. If data is tampered with or deleted during transmission, recovery mechanisms ensure that the corrupted data is retransmitted by the network adapter without the need of interference from the processing element. We simulated and implemented the complete system with SystemC TLM using the NoC simulation platform PANACA. Our results show that we can keep a high rate of correctly transmitted information even when attackers infiltrated the NoC system.","PeriodicalId":207871,"journal":{"name":"2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116897229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信