2019 International Conference on Field-Programmable Technology (ICFPT)最新文献

Static Block Floating-Point Quantization for Convolutional Neural Networks on FPGA 基于FPGA的卷积神经网络静态块浮点量化

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00012

Hongxiang Fan, Gang Wang, Martin Ferianc, Xinyu Niu, W. Luk

{"title":"Static Block Floating-Point Quantization for Convolutional Neural Networks on FPGA","authors":"Hongxiang Fan, Gang Wang, Martin Ferianc, Xinyu Niu, W. Luk","doi":"10.1109/ICFPT47387.2019.00012","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00012","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely applied in various computer vision and speech processing applications. However, the algorithmic complexity of CNNs hinders their deployment in embedded systems with limited memory and computational resources. This paper proposes static block floating-point (BFP) quantization, an effective approach involving Kullback-Leibler divergence, to determine the static shared exponents. Without need for retraining, the proposed approach is able to quantize CNNs to 8 bits with negligible accuracy loss. An FPGA-based hardware design with static BFP quantization is also proposed. Compared with 8-bit integer linear quantization, our experiments show that the hardware kernel based on static BFP quantization can achieve over 50% reduction in logic resources on an FPGA. Based on static BFP quantization, a tool implemented in the PyTorch framework is developed, which can automatically generate optimised configuration according to user requirements for given CNN models, where the entire optimization process takes only a few minutes on an Intel Xeon Silver 4110 CPU.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125313648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

ZytleBot: FPGA Integrated Development Platform for ROS Based Autonomous Mobile Robot ZytleBot:基于ROS的自主移动机器人FPGA集成开发平台

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00089

Yasuhiro Nitta, Sou Tamura, Hidetoshi Yugen, Hideki Takase

{"title":"ZytleBot: FPGA Integrated Development Platform for ROS Based Autonomous Mobile Robot","authors":"Yasuhiro Nitta, Sou Tamura, Hidetoshi Yugen, Hideki Takase","doi":"10.1109/ICFPT47387.2019.00089","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00089","url":null,"abstract":"The FPT2019 FPGA Design Competition is a competition aimed at recommending innovations in utilizing FPGAs to realize level 5 autonomous driving vehicles. We have developed \"ZytleBot\", an ROS based robot utilizing FPGA for the contest. ZytleBot calculates all processing necessary for autonomous driving in realtime within a programmable SoC. Therefore, it is possible to go around the course imitating an actual road, detect signals and obstacles and take appropriate behavior without any external operation. Robot development requires a wide range of knowledge and technology, but we have proceeded robot development efficiently by using ROS, a robot development middleware, and TurtleBot3, a robot development platform. As an application of FPGA, road surface image preprocessing and traffic signal detector using machine learning is implemented in the FPGA. The traffic signal detector uses HOG features and SVM classifiers, which runs over 270 times faster than running on the processor. We also provide ZytleBot as a platform for efficient development of FPGA integrated ROS robots.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132411018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Secure Internal Communication of a Trustzone-Enabled Heterogeneous Soc Lightweight Encryption 基于trustzone的异构Soc轻量级加密的内部通信安全

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00037

E. M. Benhani, C. M. López, L. Bossuet

引用次数: 3

OpenCL Implementation of Cannon’s Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs 基于Intel Stratix 10 fpga的Cannon矩阵乘法算法的OpenCL实现

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00020

Paolo Gorlani, Tobias Kenter, Christian Plessl

{"title":"OpenCL Implementation of Cannon’s Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs","authors":"Paolo Gorlani, Tobias Kenter, Christian Plessl","doi":"10.1109/ICFPT47387.2019.00020","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00020","url":null,"abstract":"Stratix 10 FPGA cards have a good potential for the acceleration of HPC workloads since the Stratix 10 product line introduces devices with a large number of DSP and memory blocks. The high level synthesis of OpenCL codes can play a fundamental role for FPGAs in HPC, because it allows to implement different designs with lower development effort compared to hand optimized HDL. However, Stratix 10 cards are still hard to fully exploit using the Intel FPGA SDK for OpenCL. The implementation of designs with thousands of concurrent arithmetic operations often suffers from place and route problems that limit the maximum frequency or entirely prevent a successful synthesis. In order to overcome these issues for the implementation of the matrix multiplication, we formulate Cannon's matrix multiplication algorithm with regard to its efficient synthesis within the FPGA logic. We obtain a two-level block algorithm, where the lower level sub-matrices are multiplied using our Cannon's algorithm implementation. Following this design approach with multiple compute units, we are able to get maximum frequencies close to and above 300 MHz with high utilization of DSP and memory blocks. This allows for performance results above 1 TeraFLOPS.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123449085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Implementation of Distributed Processing Using a PC-FPGA Hybrid System 用PC-FPGA混合系统实现分布式处理

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00074

Keisuke Takano, Tetsuya Oda, Ryo Ozaki, A. Uejima, M. Kohata

引用次数: 2

Extending the Lifetime of Coarse-Grained Runtime Reconfigurable FPGAs by Balancing Processing Element Usage 通过平衡处理元素的使用来延长粗粒度运行时可重构fpga的寿命

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00050

Bo Hu, M. Shihab, Y. Makris, Benjamin Carrión Schäfer, C. Sechen

引用次数: 0

FPNet: Customized Convolutional Neural Network for FPGA Platforms FPGA平台的自定义卷积神经网络

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00077

Y. Yang, Chao Wang, Lei Gong, Xuehai Zhou

引用次数: 8

Optimizing FPGA-Based Streaming Applications for Throughput Using Pipelining 使用流水线优化基于fpga的流应用程序的吞吐量

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00065

Ali Asghar, R. V. Loo, Timon Kruiper, Daniel Ziener

引用次数: 2

RNA: Reconfigurable LSTM Accelerator with Near Data Approximate Processing 具有近数据近似处理的可重构LSTM加速器

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00055

Yu Gong, Bo Liu, Wei-qi Ge, Longxing Shi

{"title":"RNA: Reconfigurable LSTM Accelerator with Near Data Approximate Processing","authors":"Yu Gong, Bo Liu, Wei-qi Ge, Longxing Shi","doi":"10.1109/ICFPT47387.2019.00055","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00055","url":null,"abstract":"Near Data Processing(NDP) techniques are introduced into deep learning accelerators as they can greatly relieve the pressure on memory bandwidth. Besides, approximate computing is also adopted in accelerating neural networks for the network fault-tolerance to reduce energy consumption. In this paper, an NDP accelerator with approximate computing features for LSTM is proposed to explore the data parallelism with reconfigurable features. Firstly, a hybrid-grained network partitioning model with scheduling strategy of LSTM is put forward to achieve high processing parallelism. Secondly, the approximate computing units are designed for LSTM with adaptive precision. Then the heterogeneous architecture, RNA, with reconfigurable computing arrays and approximate NDP units is proposed and implemented regarding the configuration code. The gates and cells in LSTM are modeled into fine-grained operations, organized in coarse-grained tasks, and then mapped onto RNA. In addition, approximate computing units are integrated into the NDP units with the adaptive precision, which is also controlled by the configuration codes. The proposed RNA architecture achieved 544 GOPS/W energy efficiency while processing LSTM, and further can be extended for larger and more complex recurrent neural networks. Comparing with the state-of-the-art accelerator for LSTM, it is 2.14 times better in efficiency.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"8 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114006885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs SpWMM:一种用于cnn的高性能稀疏- winograd矩阵-矩阵乘法加速器

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00041

Di Wu, Wei Cao, Lingli Wang

{"title":"SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs","authors":"Di Wu, Wei Cao, Lingli Wang","doi":"10.1109/ICFPT47387.2019.00041","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00041","url":null,"abstract":"In recent years, many CNN accelerators are proposed to exploit the sparsity of the networks to enjoy the benefits of both computation and memory reduction. However, these accelerators either cannot exploit the sparsity of both activations and weights, or cannot achieve stable performance with a static scheduling strategy, which is vulnerable to the sparsity distribution. This paper proposes a dynamic scheduling strategy and a balanced compressed sparse row (BCSR) format to efficiently address these two issues. A set-associate structure is presented to tradeoff the load balance and logic overhead. We propose SpWMM to accelerate the CNN inference, which is the first work to implement both sparse Winograd convolution and sparse fully-connected (FC) layers. On contemporary neural networks, this work achieves: (1) 2.6Top/s for Winograd convolution and 525Gop/s for 1×1 convolution and FC layers in the 4-way association design on Xilinx ZC706 platform, (2) 6.5 Top/s for Winograd convolution and 1.2Top/s for 1×1 convolution and FC layers in the 16-way association design on Xilinx VCU1525 platform. Compared with the state-of-the-art works on the same platform, the 4-way design achieves 2.0× speedup.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122427209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3