2018 International Conference on Field-Programmable Technology (FPT)最新文献_第8页

A Tri-State Weight Convolutional Neural Network for an FPGA: Applied to YOLOv2 Object Detector 基于FPGA的三态权重卷积神经网络:应用于YOLOv2目标检测器

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00058

Hiroki Nakahara, Masayuki Shimoda, Shimpei Sato

{"title":"A Tri-State Weight Convolutional Neural Network for an FPGA: Applied to YOLOv2 Object Detector","authors":"Hiroki Nakahara, Masayuki Shimoda, Shimpei Sato","doi":"10.1109/FPT.2018.00058","DOIUrl":"https://doi.org/10.1109/FPT.2018.00058","url":null,"abstract":"A frame object detection, such as the YOLO (You only look once), is used in embedded vision systems, such as a robot, an automobile, a security camera, and a drone. However, it requires highly performance-per-power detection by an inexpensive device. In the paper, we propose a tri-state weight CNN, which is a generalization of a low-precision and sparse (pruning) for CNN weight. In the former part, we set a weight {-1,0,+1} as a ternary CNN, while in the latter part, we set a {-w,0,+w} as a sparse weight CNN. The proposed tri-state CNN is a kind of a mixed-precision one, which is suitable for an object detector consisting of a bounding box prediction (regression) and a class estimation (classification). We apply an indirect memory access architecture to skip zero part and propose the weight parallel 2D convolutional circuit. It can efficiently be applied to the AlexNet based CNN, which has different size kernels. We design the AlexNet based YOLOv2 to reduce the number of layers toward low-latency computation. In the experiment, the proposed tri-state scheme CNN reduces the memory size for weight by 92%. We implement the proposed tri-state weight YOLOv2 on the AvNet Inc. UltraZed-EG starter kit, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC ZU3EG. It archived 61.70 frames per second (FPS), which exceeds the standard video frame rate (29.97 FPS). Compared with the ARM Cortex-A57, it was 268.2 times faster, and its performance per power efficiency was 313.51 times better. Also, compared with the NVidia Pascal embedded GPU, it was 4.0 times faster, and its power performance efficiency was 11.35 times better.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133723854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

An Automated FPGA-Based Fault Injection Platform for Granularly-Pipelined Fault Tolerant CORDIC 基于fpga的粒度流水线容错CORDIC自动故障注入平台

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00076

Yu Xie, He Chen, Yizhuang Xie, Chuang-An Mao, Bingyi Li

引用次数: 2

FLiMS: Fast Lightweight Merge Sorter 快速轻量级合并排序器

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00022

Philippos Papaphilippou, Chris Brooks, W. Luk

{"title":"FLiMS: Fast Lightweight Merge Sorter","authors":"Philippos Papaphilippou, Chris Brooks, W. Luk","doi":"10.1109/FPT.2018.00022","DOIUrl":"https://doi.org/10.1109/FPT.2018.00022","url":null,"abstract":"We have developed a highly-efficient and simple parallel hardware design for merging two sorted lists residing in banked (or multi-ported) memory. The FPGA implementation uses half the hardware resources required for implementing the current state-of-the-art architecture. This is achieved with better performance and half the latency, for the same amount of parallelism. The challenges for the merge operations in FPGAs have been the low clock frequency due to the feedback datapath of the merger being the critical path for timing, and also the high resource utilisation in recent attempts to eliminate/remove the feedback datapath. Our solution uses a modified version of the bitonic merge block, as found in a bitonic sorter, repurposed for performing parallel merge for streaming data. As with the state-of-the-art, it can be considered feedback-less since it only nests one parallel comparison for any desired level of parallelism. This leads to high operating frequency designs, 1.3 times higher than the previous best on our test platform. Since the new design uses 2 times fewer hardware resources, it allows more parallelism and leaves room for additional logic.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132955305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Optimisation of Convolution of Multiple Different Sized Filters in SKA Pulsar Search Engine SKA脉冲星搜索引擎中多个不同大小滤波器卷积的优化

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00073

Haomiao Wang, B. Stappers, P. Thiagaraj, O. Sinnen

{"title":"Optimisation of Convolution of Multiple Different Sized Filters in SKA Pulsar Search Engine","authors":"Haomiao Wang, B. Stappers, P. Thiagaraj, O. Sinnen","doi":"10.1109/FPT.2018.00073","DOIUrl":"https://doi.org/10.1109/FPT.2018.00073","url":null,"abstract":"Pulsar search is one of the main tasks for the Square Kilometre Array (SKA) central signal processor (CSP) sub-element. Because most of the pulsar details are unknown, many pulsar search approaches are employed. The main compute-intensive application of the pulsar search modules is the matched filter group, which convolves the input signals with a group of filters. High-performance designs on FPGAs have been proposed that can process multiple large filters efficiently. But given that in many applications, including the here targeted pulsar search, filters have different sizes, there is a high potential for optimisation. This paper investigates the optimisation of general matched filtering design for the SKA pulsar search engine. The influence of changing the number of filters and the difference in sizes is analysed. The general implementations in time-domain (TD) and frequency-domain (FD) are optimised, employing the longest processing time (LPT) first rule to distribute filter templates across filter processing pipelines. The proposed design is employed to implement the matched filter groups in two SKA pulsar modules. The results show that the optimisation can provide up to 2.1x speedup in TD and 1.2x speedup in FD.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114303354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-Time Object Detection and Semantic Segmentation Hardware System with Deep Learning Networks 基于深度学习网络的实时目标检测和语义分割硬件系统

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00081

Shaoxia Fang, Lu Tian, Junbin Wang, Shuang Liang, Dongliang Xie, Zhongmin Chen, Lingzhi Sui, Qian Yu, Xiaoming Sun, Yi Shan, Yu Wang

引用次数: 12

A Real-Time Object Detection Accelerator with Compressed SSDLite on FPGA 基于FPGA的压缩SSDLite实时目标检测加速器

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00014

Hongxiang Fan, Shuanglong Liu, Martin Ferianc, Ho-Cheung Ng, Zhiqiang Que, Shen Liu, Xinyu Niu, W. Luk

{"title":"A Real-Time Object Detection Accelerator with Compressed SSDLite on FPGA","authors":"Hongxiang Fan, Shuanglong Liu, Martin Ferianc, Ho-Cheung Ng, Zhiqiang Que, Shen Liu, Xinyu Niu, W. Luk","doi":"10.1109/FPT.2018.00014","DOIUrl":"https://doi.org/10.1109/FPT.2018.00014","url":null,"abstract":"Convolutional neural network (CNN)-based object detection has been widely employed in various applications such as autonomous driving and intelligent video surveillance. However, the computational complexity of conventional convolution hinders its application in embedded systems. Recently, a mobile-friendly CNN model SSDLite-MobileNetV2 (SSDLiteM2) has been proposed for object detection. This model consists of a novel layer called bottleneck residual block (BRB). Although SSDLiteM2 contains far fewer parameters and computations than conventional CNN models, its performance on embedded devices still cannot meet the requirements of real-time processing. This paper proposes a novel FPGA-based architecture for SSDLiteM2 in combination with hardware optimizations including fused BRB, processing element (PE) sharing and load-balanced channel pruning. Moreover, a novel quantization scheme called partial quantization has been developed, which partially quantizes SSDLiteM2 to 8 bits with only 1.8% accuracy loss. Experiments show that the proposed design on a Xilinx ZC706 device can achieve up to 65 frames per second with 20.3 mean average precision on the COCO dataset.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124474270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Improving Confidentiality in Virtualized FPGAs 提高虚拟化fpga的保密性

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00048

S. Yazdanshenas, Vaughn Betz

引用次数: 4

Dither NN: An Accurate Neural Network with Dithering for Low Bit-Precision Hardware 抖动神经网络:用于低位精度硬件的具有抖动的精确神经网络

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00013

Kota Ando, Kodai Ueyoshi, Yuka Oba, Kazutoshi Hirose, Ryota Uematsu, Takumi Kudo, M. Ikebe, T. Asai, Shinya Takamaeda-Yamazaki, M. Motomura

引用次数: 9

An FPGA Implementation of Robust Matting 鲁棒抠图的FPGA实现

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00047

Takuya Yamazaki, T. Maruyama

引用次数: 0

FCLNN: A Flexible Framework for Fast CNN Prototyping on FPGA with OpenCL and Caffe FCLNN:基于OpenCL和Caffe的FPGA快速CNN原型设计的灵活框架

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00043

Xianchao Xu, Brian Liu

引用次数: 17