{"title":"A Tri-State Weight Convolutional Neural Network for an FPGA: Applied to YOLOv2 Object Detector","authors":"Hiroki Nakahara, Masayuki Shimoda, Shimpei Sato","doi":"10.1109/FPT.2018.00058","DOIUrl":"https://doi.org/10.1109/FPT.2018.00058","url":null,"abstract":"A frame object detection, such as the YOLO (You only look once), is used in embedded vision systems, such as a robot, an automobile, a security camera, and a drone. However, it requires highly performance-per-power detection by an inexpensive device. In the paper, we propose a tri-state weight CNN, which is a generalization of a low-precision and sparse (pruning) for CNN weight. In the former part, we set a weight {-1,0,+1} as a ternary CNN, while in the latter part, we set a {-w,0,+w} as a sparse weight CNN. The proposed tri-state CNN is a kind of a mixed-precision one, which is suitable for an object detector consisting of a bounding box prediction (regression) and a class estimation (classification). We apply an indirect memory access architecture to skip zero part and propose the weight parallel 2D convolutional circuit. It can efficiently be applied to the AlexNet based CNN, which has different size kernels. We design the AlexNet based YOLOv2 to reduce the number of layers toward low-latency computation. In the experiment, the proposed tri-state scheme CNN reduces the memory size for weight by 92%. We implement the proposed tri-state weight YOLOv2 on the AvNet Inc. UltraZed-EG starter kit, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC ZU3EG. It archived 61.70 frames per second (FPS), which exceeds the standard video frame rate (29.97 FPS). Compared with the ARM Cortex-A57, it was 268.2 times faster, and its performance per power efficiency was 313.51 times better. Also, compared with the NVidia Pascal embedded GPU, it was 4.0 times faster, and its power performance efficiency was 11.35 times better.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133723854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Xie, He Chen, Yizhuang Xie, Chuang-An Mao, Bingyi Li
{"title":"An Automated FPGA-Based Fault Injection Platform for Granularly-Pipelined Fault Tolerant CORDIC","authors":"Yu Xie, He Chen, Yizhuang Xie, Chuang-An Mao, Bingyi Li","doi":"10.1109/FPT.2018.00076","DOIUrl":"https://doi.org/10.1109/FPT.2018.00076","url":null,"abstract":"Augment of integration and complexity makes VLSI circuits more sensitive to errors. Also, soft errors caused by Single Event Upset (SEU) have become a significant threat to modern electronic systems. Therefore, the demand of high reliability on modern electronic systems keeps increasing. Aiming at reliability evaluation of fault tolerant very large scale integrated circuits implemented on SRAM-based FPGA, an automated fault injection platform via Internal Configuration Access Port (ICAP) for rapid fault injection is presented in this paper. We adopt a granularly-pipelined fault tolerant CORDIC processor as the Design Under Test (DUT), and a C++ script is deployed for the external fault injection control environment and automating the fault injection procedure. The proposed method can achieve quantities of repeating fault injection tests and is suitable for any fault tolerant design implemented in SRAM-Based FPGA.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123834858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FLiMS: Fast Lightweight Merge Sorter","authors":"Philippos Papaphilippou, Chris Brooks, W. Luk","doi":"10.1109/FPT.2018.00022","DOIUrl":"https://doi.org/10.1109/FPT.2018.00022","url":null,"abstract":"We have developed a highly-efficient and simple parallel hardware design for merging two sorted lists residing in banked (or multi-ported) memory. The FPGA implementation uses half the hardware resources required for implementing the current state-of-the-art architecture. This is achieved with better performance and half the latency, for the same amount of parallelism. The challenges for the merge operations in FPGAs have been the low clock frequency due to the feedback datapath of the merger being the critical path for timing, and also the high resource utilisation in recent attempts to eliminate/remove the feedback datapath. Our solution uses a modified version of the bitonic merge block, as found in a bitonic sorter, repurposed for performing parallel merge for streaming data. As with the state-of-the-art, it can be considered feedback-less since it only nests one parallel comparison for any desired level of parallelism. This leads to high operating frequency designs, 1.3 times higher than the previous best on our test platform. Since the new design uses 2 times fewer hardware resources, it allows more parallelism and leaves room for additional logic.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132955305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haomiao Wang, B. Stappers, P. Thiagaraj, O. Sinnen
{"title":"Optimisation of Convolution of Multiple Different Sized Filters in SKA Pulsar Search Engine","authors":"Haomiao Wang, B. Stappers, P. Thiagaraj, O. Sinnen","doi":"10.1109/FPT.2018.00073","DOIUrl":"https://doi.org/10.1109/FPT.2018.00073","url":null,"abstract":"Pulsar search is one of the main tasks for the Square Kilometre Array (SKA) central signal processor (CSP) sub-element. Because most of the pulsar details are unknown, many pulsar search approaches are employed. The main compute-intensive application of the pulsar search modules is the matched filter group, which convolves the input signals with a group of filters. High-performance designs on FPGAs have been proposed that can process multiple large filters efficiently. But given that in many applications, including the here targeted pulsar search, filters have different sizes, there is a high potential for optimisation. This paper investigates the optimisation of general matched filtering design for the SKA pulsar search engine. The influence of changing the number of filters and the difference in sizes is analysed. The general implementations in time-domain (TD) and frequency-domain (FD) are optimised, employing the longest processing time (LPT) first rule to distribute filter templates across filter processing pipelines. The proposed design is employed to implement the matched filter groups in two SKA pulsar modules. The results show that the optimisation can provide up to 2.1x speedup in TD and 1.2x speedup in FD.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114303354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaoxia Fang, Lu Tian, Junbin Wang, Shuang Liang, Dongliang Xie, Zhongmin Chen, Lingzhi Sui, Qian Yu, Xiaoming Sun, Yi Shan, Yu Wang
{"title":"Real-Time Object Detection and Semantic Segmentation Hardware System with Deep Learning Networks","authors":"Shaoxia Fang, Lu Tian, Junbin Wang, Shuang Liang, Dongliang Xie, Zhongmin Chen, Lingzhi Sui, Qian Yu, Xiaoming Sun, Yi Shan, Yu Wang","doi":"10.1109/FPT.2018.00081","DOIUrl":"https://doi.org/10.1109/FPT.2018.00081","url":null,"abstract":"Advanced Driver Assistance Systems (ADAS) help the driver in the driving process by detecting objects, doing basic classification, implementing safety guards and so on. Convolution Neural Networks (CNN) has been proved to be an essential to support ADAS. We designed an architecture named Aristotle to execute neural networks for both object detection and semantic segmentation on FPGA. DNNDK (Deep Learning Development Toolkit), a full-stack software tool, with tens of compilation optimization techniques is proposed to improve the energy efficiency and make it easy to develop. The Aristotle architecture is implemented on Xilinx ZU9 FPGA, and two networks are deployed on it to execute object detection and semantic segmentation, respectively.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114830967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongxiang Fan, Shuanglong Liu, Martin Ferianc, Ho-Cheung Ng, Zhiqiang Que, Shen Liu, Xinyu Niu, W. Luk
{"title":"A Real-Time Object Detection Accelerator with Compressed SSDLite on FPGA","authors":"Hongxiang Fan, Shuanglong Liu, Martin Ferianc, Ho-Cheung Ng, Zhiqiang Que, Shen Liu, Xinyu Niu, W. Luk","doi":"10.1109/FPT.2018.00014","DOIUrl":"https://doi.org/10.1109/FPT.2018.00014","url":null,"abstract":"Convolutional neural network (CNN)-based object detection has been widely employed in various applications such as autonomous driving and intelligent video surveillance. However, the computational complexity of conventional convolution hinders its application in embedded systems. Recently, a mobile-friendly CNN model SSDLite-MobileNetV2 (SSDLiteM2) has been proposed for object detection. This model consists of a novel layer called bottleneck residual block (BRB). Although SSDLiteM2 contains far fewer parameters and computations than conventional CNN models, its performance on embedded devices still cannot meet the requirements of real-time processing. This paper proposes a novel FPGA-based architecture for SSDLiteM2 in combination with hardware optimizations including fused BRB, processing element (PE) sharing and load-balanced channel pruning. Moreover, a novel quantization scheme called partial quantization has been developed, which partially quantizes SSDLiteM2 to 8 bits with only 1.8% accuracy loss. Experiments show that the proposed design on a Xilinx ZC706 device can achieve up to 65 frames per second with 20.3 mean average precision on the COCO dataset.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124474270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Confidentiality in Virtualized FPGAs","authors":"S. Yazdanshenas, Vaughn Betz","doi":"10.1109/FPT.2018.00048","DOIUrl":"https://doi.org/10.1109/FPT.2018.00048","url":null,"abstract":"FPGAs are being deployed in modern datacenters to provide users with specialized accelerators that offer superior compute capability, increased energy efficiency, lower latency, and more programming flexibility than CPUs. However, FPGAs are not utilized as efficiently in datacenters: unlike CPUs, FPGAs in datacenters are currently not shared between users due to potential security risks. The higher flexibility that comes with FPGAs also gives more capabilities to malicious users. Several recent studies have demonstrated examples of FPGA user applications capable of remotely sniffing data from other applications running on the same FPGA. In this work, we look at various ways to ameliorate these threats by encrypting/decrypting the user application's data under different trust levels for current virtualized FPGAs. We also discuss the role of interconnect and discuss the potential of more efficient security features that can be implemented together with the interconnect if the FPGAs use a hard network on chip.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131872728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kota Ando, Kodai Ueyoshi, Yuka Oba, Kazutoshi Hirose, Ryota Uematsu, Takumi Kudo, M. Ikebe, T. Asai, Shinya Takamaeda-Yamazaki, M. Motomura
{"title":"Dither NN: An Accurate Neural Network with Dithering for Low Bit-Precision Hardware","authors":"Kota Ando, Kodai Ueyoshi, Yuka Oba, Kazutoshi Hirose, Ryota Uematsu, Takumi Kudo, M. Ikebe, T. Asai, Shinya Takamaeda-Yamazaki, M. Motomura","doi":"10.1109/FPT.2018.00013","DOIUrl":"https://doi.org/10.1109/FPT.2018.00013","url":null,"abstract":"Energy-constrained neural network processing is in high demanded for various mobile applications. Binary neural network aggressively enhances the computational efficiency, and in contrast, it suffers from degradation of accuracy due to its extreme approximation. We propose a novel accurate neural network model based on binarization and \"dithering\" that distributes the quantization error to neighboring pixels. The quantization errors in the binarization are distributed in the plane, so that a pixel in the multi-level source expression more accurately represented in the resulting binarized plane by multiple pixels. We designed a low-overhead binary-based hardware architecture for the proposed model. The evaluation results show that this method can be realized with a few additional lightweight hardware components.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128928042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA Implementation of Robust Matting","authors":"Takuya Yamazaki, T. Maruyama","doi":"10.1109/FPT.2018.00047","DOIUrl":"https://doi.org/10.1109/FPT.2018.00047","url":null,"abstract":"Matting is a process of extracting a foreground from background in an image. It is one of the key techniques in many image editing, and many algorithms have been proposed. Robust Matting is one of the powerful matting algorithms. In the Robust Matting, first, a set of pixels in the foreground and background are sampled, and then, for each pixel in the image, its category is determined by comparing it to a number of pairs of a foreground and background sample. Its computational complexity is very high because of a large number of multiply and square operations. In this paper, we propose an FPGA implementation for real-time processing of HD images. In our approach, the image is divided into small blocks and the same pairs of the foreground and background pixels are used for the pixels in the same block in order to reduce the number of the operations. This approach makes it possible to execute multiply operations by simply looking up tables that are dynamically updated block by block.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131084578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FCLNN: A Flexible Framework for Fast CNN Prototyping on FPGA with OpenCL and Caffe","authors":"Xianchao Xu, Brian Liu","doi":"10.1109/FPT.2018.00043","DOIUrl":"https://doi.org/10.1109/FPT.2018.00043","url":null,"abstract":"The CNN algorithms are still in rapid evolution, while the traditional RTL level programming on FPGA is relatively slow and requires great efforts and expertise. In this paper, we propose a flexible HW/SW co-design framework for both fast and high-throughput CNN prototyping with commercial high-level OpenCL language and the standard open-source deep learning framework Caffe. We build up a parameterizable stream-architected convolution engine and extend it to support any input size and filter depth. For iterative development process, we provide both layer-based and subgraph-based execution schedule. While for competitive performance, both on-chip and off-chip communication are optimized. Using our framework with Intel Arria 10 GX1150 FPGA, we achieve 69.2 fps and 18.6 fps on official YOLOv2-tiny-voc and YOLOv2-voc respectively. To the best of our knowledge, this is the first work to accelerate the state-of-the-art YOLOv2 with both real-time performance and < 1% accuracy drop on FPGA.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121710548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}