ACM Transactions on Reconfigurable Technology and Systems (TRETS)最新文献_第9页

PIMap

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-01-09 DOI: 10.1145/3268344

Gai Liu, Zhiru Zhang

{"title":"PIMap","authors":"Gai Liu, Zhiru Zhang","doi":"10.1145/3268344","DOIUrl":"https://doi.org/10.1145/3268344","url":null,"abstract":"Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the “known recipes” of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. In this article, we propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. By adjusting the optimization objective and incorporating required time constraints during the iterative process, PIMap can flexibly optimize for different objectives including area minimization, delay optimization, and delay-constrained area reduction. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that PIMap achieves promising quality improvement over a set of commonly used benchmarks, including improving the majority of the best-known area and delay records for the EPFL benchmark suite.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114066243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Efficient Fine-grained Processor-logic Interactions on the Cache-coherent Zynq Platform 高速缓存相干Zynq平台上高效的细粒度处理器逻辑交互

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-01-09 DOI: 10.1145/3277506

Alexander Kroh, O. Diessel

引用次数: 1

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing 面向可重构计算的位-序列矩阵乘法优化

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-01-02 DOI: 10.1145/3337929

Yaman Umuroglu, Davide Conficconi, Lahiru Rasnayake, Thomas B. Preußer, Magnus Själander

引用次数: 14

Introduction to the Special Section on Deep Learning in FPGAs fpga中深度学习专题介绍

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-22 DOI: 10.1145/3294768

Deming Chen, Andrew Putnam, S. Wilton

{"title":"Introduction to the Special Section on Deep Learning in FPGAs","authors":"Deming Chen, Andrew Putnam, S. Wilton","doi":"10.1145/3294768","DOIUrl":"https://doi.org/10.1145/3294768","url":null,"abstract":"The rapid advance of Deep Learning (DL), especially via Deep Neural Networks (DNNs), has been shown to compete with and even exceed human capabilities in tasks such as image recognition, playing complex games, and large-scale information retrieval. However, due to the high computational and power demands of deep neural networks, hardware accelerators are essential to ensure that the computation speed meets the application requirements. Field-programmable gate arrays (FPGAs) have demonstrated great strength in accelerating deep learning inference with high energy efficiency. To explore the strength of FPGA thoroughly and create a pool of advanced representative research works, we started a call for a special issue of TRETS with the topic of DL on FPGAs. The topics of interest include many different aspects of DL on FPGAs, including compilers, tools, and design methodologies, microarchitectures, cloud deployments, edge or IoT, DNN compression, security, comparison studies, survey studies, and others. Many people answered this call and submitted their most recent research results. After a subset of the submissions was desk-rejected for quality control purposes, a total of 23 manuscripts went through a full-blown reviewing process. To facilitate a fast, fair, and effective reviewing process for this special issue, we formed a special pool of reviewers who are experts on DL and FPGA topics. After a rigorous reviewing process, eight top-quality papers have been accepted into this special issue so far. The following list shows the title of the paper and the institute(s) of the authors, and highlights the contributions of each article.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115676446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA 基于FPGA的深度自定义卷积和反卷积架构优化cnn分割

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-20 DOI: 10.1145/3242900

Shuanglong Liu, Hongxiang Fan, Xinyu Niu, Ho-Cheung Ng, Yang Chu, W. Luk

{"title":"Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA","authors":"Shuanglong Liu, Hongxiang Fan, Xinyu Niu, Ho-Cheung Ng, Yang Chu, W. Luk","doi":"10.1145/3242900","DOIUrl":"https://doi.org/10.1145/3242900","url":null,"abstract":"Convolutional Neural Networks-- (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive, which limits their applicability to real-time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms that have been widely used to accelerate CNN algorithms by practitioners and researchers due to their high performance and power efficiency. In this work, we propose and develop deconvolution architecture for efficient FPGA implementation. FPGA-based accelerators are proposed for both deconvolution and CNN algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space to achieve optimal processing speed of the system and improve power efficiency. Furthermore, a hardware mapping framework is developed to automatically generate the low-latency hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 giga operations per second (GOPS) under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq ZC706 board, and the system achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization and supports up to 17 frames per second for 512 × 512 image inputs with a power consumption of only 9.6W.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123288848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA 基于FPGA的指令驱动跨层CNN快速检测加速器

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-15 DOI: 10.1145/3283452

Jincheng Yu, Guangjun Ge, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu Wang, Huazhong Yang

{"title":"Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA","authors":"Jincheng Yu, Guangjun Ge, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu Wang, Huazhong Yang","doi":"10.1145/3283452","DOIUrl":"https://doi.org/10.1145/3283452","url":null,"abstract":"In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object detection tasks. Although there are many optimizing methods to speed up CNN-based detection algorithms, it is still difficult to deploy detection algorithms on real-time low-power systems. Field-Programmable Gate Array (FPGA) has been widely explored as a platform for accelerating CNN due to its promising performance, high energy efficiency, and flexibility. Previous works show that the energy consumption of CNN accelerators is dominated by the memory access. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. However, previous accelerators with the cross-layer scheduling are designed for a particular CNN model. In addition to the memory access optimization, the Winograd algorithm can greatly improve the computational performance of convolution. In this article, to improve the flexibility of hardware, we design an instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection. We modify the loop unrolling order of CNN, so that we can schedule a CNN across different layers with instructions and eliminate the intermediate data transfer. We propose a hardware architecture to support the instructions with Winograd computation units and reach the state-of-the-art energy efficiency. To deploy image detection algorithms onto the proposed accelerator with fixed-point computation units, we adopt the fixed-point fine-tune method, which can guarantee the accuracy of the detection algorithms. We evaluate our accelerator and scheduling policy on the Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by more than 90% on the VGG-D CNN model with the cross-layer strategy. Thus, the performance of our hardware accelerator reaches 1700GOP/s on the classification model VGG-D. We also implement a framework for object detection algorithms, which achieves 2.3× and 50× in energy efficiency compared with GPU and CPU, respectively. Compared with floating-point algorithms, the accuracy of the fixed-point detection algorithms only drops by less than 1%.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133540640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

ReDCrypt

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-15 DOI: 10.1145/3242899

B. Rouhani, S. Hussain, K. Lauter, F. Koushanfar

{"title":"ReDCrypt","authors":"B. Rouhani, S. Hussain, K. Lauter, F. Koushanfar","doi":"10.1145/3242899","DOIUrl":"https://doi.org/10.1145/3242899","url":null,"abstract":"Artificial Intelligence (AI) is increasingly incorporated into the cloud business in order to improve the functionality (e.g., accuracy) of the service. The adoption of AI as a cloud service raises serious privacy concerns in applications where the risk of data leakage is not acceptable. Examples of such applications include scenarios where clients hold potentially sensitive private information such as medical records, financial data, and/or location. This article proposes ReDCrypt, the first reconfigurable hardware-accelerated framework that empowers privacy-preserving inference of deep learning models in cloud servers. ReDCrypt is well-suited for streaming (a.k.a., real-time AI) settings where clients need to dynamically analyze their data as it is collected over time without having to queue the samples to meet a certain batch size. Unlike prior work, ReDCrypt neither requires to change how AI models are trained nor relies on two non-colluding servers to perform. The privacy-preserving computation in ReDCrypt is executed using Yao’s Garbled Circuit (GC) protocol. We break down the deep learning inference task into two phases: (i) privacy-insensitive (local) computation, and (ii) privacy-sensitive (interactive) computation. We devise a high-throughput and power-efficient implementation of GC protocol on FPGA for the privacy-sensitive phase. ReDCrypt’s accompanying API provides support for seamless integration of ReDCrypt into any deep learning framework. Proof-of-concept evaluations for different DL applications demonstrate up to 57-fold higher throughput per core compared to the best prior solution with no drop in the accuracy.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125693595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

You Cannot Improve What You Do not Measure 不衡量就无法提高

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-12 DOI: 10.1145/3242898

Andrew Boutros, S. Yazdanshenas, Vaughn Betz

{"title":"You Cannot Improve What You Do not Measure","authors":"Andrew Boutros, S. Yazdanshenas, Vaughn Betz","doi":"10.1145/3242898","DOIUrl":"https://doi.org/10.1145/3242898","url":null,"abstract":"Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 2.8× to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130835449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs 用高精度存储和高能效的lightnn减轻负载

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-12 DOI: 10.1145/3270689

Ruizhou Ding, Z. Liu, R. D. Blanton, Diana Marculescu

{"title":"Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs","authors":"Ruizhou Ding, Z. Liu, R. D. Blanton, Diana Marculescu","doi":"10.1145/3270689","DOIUrl":"https://doi.org/10.1145/3270689","url":null,"abstract":"Hardware implementations of deep neural networks (DNNs) have been adopted in many systems because of their higher classification speed. However, while they may be characterized by better accuracy, larger DNNs require significant energy and area, thereby limiting their wide adoption. The energy consumption of DNNs is driven by both memory accesses and computation. Binarized neural networks (BNNs), as a tradeoff between accuracy and energy consumption, can achieve great energy reduction and have good accuracy for large DNNs due to their regularization effect. However, BNNs show poor accuracy when a smaller DNN configuration is adopted. In this article, we propose a new DNN architecture, LightNN, which replaces the multiplications to one shift or a constrained number of shifts and adds. Our theoretical analysis for LightNNs shows that their accuracy is maintained while dramatically reducing storage and energy requirements. For a fixed DNN configuration, LightNNs have better accuracy at a slight energy increase than BNNs, yet are more energy efficient with only slightly less accuracy than conventional DNNs. Therefore, LightNNs provide more options for hardware designers to trade off accuracy and energy. Moreover, for large DNN configurations, LightNNs have a regularization effect, making them better in accuracy than conventional DNNs. These conclusions are verified by experiment using the MNIST and CIFAR-10 datasets for different DNN configurations. Our FPGA implementation for conventional DNNs and LightNNs confirms all theoretical and simulation results and shows that LightNNs reduce latency and use fewer FPGA resources compared to conventional DNN architectures.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121075656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression 自定义加法树和权值压缩的高效卷积三元神经网络

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-12 DOI: 10.1145/3270764

Adrien Prost-Boucle, A. Bourge, F. Pétrot

{"title":"High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression","authors":"Adrien Prost-Boucle, A. Bourge, F. Pétrot","doi":"10.1145/3270764","DOIUrl":"https://doi.org/10.1145/3270764","url":null,"abstract":"Although performing inference with artificial neural networks (ANN) was until quite recently considered as essentially compute intensive, the emergence of deep neural networks coupled with the evolution of the integration technology transformed inference into a memory bound problem. This ascertainment being established, many works have lately focused on minimizing memory accesses, either by enforcing and exploiting sparsity on weights or by using few bits for representing activations and weights, to be able to use ANNs inference in embedded devices. In this work, we detail an architecture dedicated to inference using ternary {−1, 0, 1} weights and activations. This architecture is configurable at design time to provide throughput vs. power trade-offs to choose from. It is also generic in the sense that it uses information drawn for the target technologies (memory geometries and cost, number of available cuts, etc.) to adapt at best to the FPGA resources. This allows to achieve up to 5.2k frames per second per Watt for classification on a VC709 board using approximately half of the resources of the FPGA.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121266946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20