{"title":"A High Energy-Efficiency FPGA-Based LSTM Accelerator Architecture Design by Structured Pruning and Normalized Linear Quantization","authors":"Yong Zheng, Haigang Yang, Zhihong Huang, Tianli Li, Yiping Jia","doi":"10.1109/ICFPT47387.2019.00045","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00045","url":null,"abstract":"LSTM (Long Short-Term Memory) is an artificial recurrent neural network (RNN) architecture and has been successfully applied to the areas where sequences of data need to be dealt with such as Natural Language Processing (NLP), speech recognition, etc. In this work, we explore an avenue to minimization of the LSTM inference part design based on FPGA for high performance and energy-efficiency. First, the model is pruned to create structured sparse features for the hardware-friendly purpose by using permuted block diagonal mask matrices. To further compress the model, we quantize the weights and activations following a normalized linear quantization approach. As a result, computational activities of the network are significantly deducted with an egligible loss on accuracy. Then a hardware architecture design has been devised to fully exploit the benefits of regular sparse structure. Having been implemented on Arria 10 (10AX115U4F45I3SG) FPGA running at 150 MHz, our accelerator demonstrates a peak performance of 2.22 TOPS at a power dissipation of 1.679 Watts. In comparison to the other FPGA-based LSTM accelerator designs previously reported, our approach achieves a 1.17-2.16x speedup in processing.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122511652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OBFS: OpenCL Based BFS Optimizations on Software Programmable FPGAs","authors":"Cheng Liu, Xinyu Chen, Bingsheng He, Xiaofei Liao, Ying Wang, Lei Zhang","doi":"10.1109/ICFPT47387.2019.00056","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00056","url":null,"abstract":"Breadth First Search (BFS) is a key building block of graph processing and there have been considerable efforts devoted to accelerating BFS on FPGAs for both performance and energy efficiency. Prior work typically built the BFS accelerator through handcrafted circuit design using hardware description language (HDL). Despite the relatively good performance, the HDL based design leads to extremely low design productivity, and incurs high portability and maintenance cost. While high level synthesis (HLS) tools make it convenient to create a functionally correct BFS accelerator, the performance can be much lower the handcrafted design with HDL. To obtain both the near handcrafted design performance and better software-like features such as portability and maintenance, we propose OBFS, an OpenCL based BFS accelerator on software programmable FPGAs. With the observation that OpenCL based FPGA design is rather inefficient on irregular memory accesses, we propose approaches including data alignment, graph reordering and batching to ensure coalesced memory accesses. In addition, we take advantage of the on-chip buffer to mitigate the inefficient random DDR accesses. Finally, we shift the random level update in BFS out from the main processing pipeline and have it overlapped with the following BFS processing task. According to the experiments, OBFS achieves 9.5X and 5.5X performance speedup on average compared to a vertex-centric implementation and an edge-centric implementation respectively on Intel Harp-v2. When compared to prior handcrafted designs, it achieves comparable or even better performance.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131912040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prajith Ramakrishnan Geethakumari, Vincenzo Gulisano, P. Trancoso, I. Sourdis
{"title":"Time-SWAD: A Dataflow Engine for Time-Based Single Window Stream Aggregation","authors":"Prajith Ramakrishnan Geethakumari, Vincenzo Gulisano, P. Trancoso, I. Sourdis","doi":"10.1109/ICFPT47387.2019.00017","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00017","url":null,"abstract":"High throughput and low latency streaming aggregation is essential for many applications that analyze massive volumes of data in real-time. Incoming data need to be stored in a single sliding window before processing, in cases where incremental aggregations are wasteful or not possible at all; this puts tremendous pressure to the memory bandwidth. In addition, particular problems call for time-based windows, defined by a time-interval, where the amount of data per window may vary and as a consequence are more challenging to handle. This paper describes Time-SWAD, the first accelerator for time-based single-window stream aggregation. Time-SWAD is a dataflow engine (DFE), implemented on a Maxeler machine, offering high processing throughput, up to 150 Mtuples/sec, similar to related GPU systems, which however do not support both time-based and single windows. It uses a direct feed of incoming data from the network and has direct access to off-chip DRAM, enabling ultra-low processing latency of 1-10 µsec, at least 4 orders of magnitude lower than software-based solutions.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114264436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Yeo, Damon Hill, Anzhen Huang, Xueao Liu, G. Dong, D. Bailey
{"title":"Image Processing and Vehicles – Using FPGA to Reduce Latency of Time Critical Tasks","authors":"A. Yeo, Damon Hill, Anzhen Huang, Xueao Liu, G. Dong, D. Bailey","doi":"10.1109/ICFPT47387.2019.00097","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00097","url":null,"abstract":"When automating complex vision-controlled objects such as self-driving cars, large amounts of data need to be processed accurately, quickly and reliably at video frame rates. In this paper we propose the use of an Intel Cyclone V FPGA to process image data in a parallel form, building a safe real-time system. We discuss the hardware used to build the physical prototype and the control algorithms to build the control architecture that controls the prototype.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126279903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taito Manabe, Naofumi Yoshinaga, Yuta Imamura, Taichi Saikai, Koki Fujita, Masatomo Matsuda, Kotoko Miyata, Tatsuma Mori, Yuichiro Shibata, H. Egawa, Yuichi Kawamata, Tomohiro Kida, Ryouhei Tsugami, Ryohei Kakizaki, Taichi Katayama, Koki Tomonaga, Shota Fukui
{"title":"Autonomous Vehicle Driving Using the Stream-Based Real-Time Hardware Line Detector","authors":"Taito Manabe, Naofumi Yoshinaga, Yuta Imamura, Taichi Saikai, Koki Fujita, Masatomo Matsuda, Kotoko Miyata, Tatsuma Mori, Yuichiro Shibata, H. Egawa, Yuichi Kawamata, Tomohiro Kida, Ryouhei Tsugami, Ryohei Kakizaki, Taichi Katayama, Koki Tomonaga, Shota Fukui","doi":"10.1109/ICFPT47387.2019.00093","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00093","url":null,"abstract":"To achieve the level 5 autonomous driving, which enables a totally driver-less vehicle, image recognition ability that is close to the human level is essential, since most information required for safe driving is currently provided as visual information, such as traffic lanes and signs. Though the image recognition includes various technologies, we focus on line detection in this paper, which can be used especially for lane keeping. To achieve real-time line detection with lower latency and power consumption, we prefer stream-based hardware implementation using an FPGA. A line segment detector (LSD) is an algorithm for line detection based on intensity gradient, and is better than the well-known Hough transform in terms of processing speed and accuracy. However, to implement the LSD on FPGAs in a stream manner is difficult due to its iterative approach. Therefore, we propose a simple and stream-friendly line detection algorithm based on the LSD. Evaluation results reveal that the implemented system is compact while maintaining 60 fps throughput for VGA moving images. We also introduce other components to be used to build an autonomous driving system in this paper.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128514119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient OS Hardware Accelerators Preemption Management in FPGA","authors":"Ye Tian, Jean-Christophe Prévotet, F. Nouvel","doi":"10.1109/ICFPT47387.2019.00069","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00069","url":null,"abstract":"The management of reconfiguration in FPGAs constitutes a hot topic in a lot of domains. In such devices, a reconfigurable fabric is generally combined with a processor to guarantee high computing performance with a limited amount of hardware resources. Most of these devices generally feature an operating system (OS) that interacts with hardware Intellectual Property (IP) resources. Software tasks (managed by the OS) may then access hardware resources concurrently and dedicated mechanisms have to be provided to manage resource sharing efficiently. The problem is even bigger if hardware resources are localized in a reconfigurable area. In this paper, we deal with the problem of sharing hardware resources in a reconfigurable device. We propose a preemption mechanism for hardware resources that may reduce the reconfiguration time overhead to be compatible with the timing constraints of most embedded applications.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124559658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamish Simmonds, Nicholas Carlisle, Xue Li, Fanglin Mu, D. Bailey
{"title":"Autonomous Vehicle Development Using FPGA for Image Processing","authors":"Hamish Simmonds, Nicholas Carlisle, Xue Li, Fanglin Mu, D. Bailey","doi":"10.1109/ICFPT47387.2019.00090","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00090","url":null,"abstract":"This project outlines the steps taken to implement an autonomously driving car. As software-based image processing can be very slow and consume a lot of power, the algorithms have been implemented on an FPGA. To reduce computation times and complexity, opportunities for simplifying the algorithms are explored. White line detection and saturation thresholding is explored for lane and object detection respectively.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116307439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomohiro Ueno, Takaaki Miyajima, Antoniette Mondigo, K. Sano
{"title":"Hybrid Network Utilization for Efficient Communication in a Tightly Coupled FPGA Cluster","authors":"Tomohiro Ueno, Takaaki Miyajima, Antoniette Mondigo, K. Sano","doi":"10.1109/ICFPT47387.2019.00068","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00068","url":null,"abstract":"A tightly coupled FPGA cluster is a promising approach for large-scale parallel processing with application specialized hardware. Along with the advantages of FPGA-based custom computing, such as high power efficiency, a customized network subsystem with efficient communication through direct Inter-FPGA links allows an FPGA cluster to be an effective platform for large-scale parallel processing. However, the cluster can suffer from substantial communication costs when a cluster becomes larger to obtain higher computing performance. In this paper, we propose to exploit the communication capacity of a host server network to improve communication performance. Besides, we show estimations for practical communication patterns on a network model in which we efficiently use both the FPGA and the host networks.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128756886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An OpenCL-Based Hybrid CNN-RNN Inference Accelerator On FPGA","authors":"Yunfei Sun, Brian Liu, Xianchao Xu","doi":"10.1109/ICFPT47387.2019.00048","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00048","url":null,"abstract":"Recently, Convolution Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and CNN-RNN hybrid networks have demonstrated great success in many deep learning scenarios. Although many dedicated FPGA accelerators for a certain kind of network have been proposed, few of them combine CNN and RNN acceleration together. In this paper we propose a high-throughput and resource-efficient CNN-RNN fusion accelerator on FPGA with commercial OpenCL to support general-purpose DNNs. It utilizes a novel streaming architecture and mapping strategy to implement the most computationintensive and resource-demanding parts in DNNs on the same computation logic. By such a hardware reuse method, it realizes resource efficiency in accelerating CNNs, RNNs and their hybrid networks. Our accelerator follows a layer-by-layer, subgraph-by-subgraph or subnetwork-by-subnetwork execution mode, which facilities it to deploy most DNNs flexibly during runtime with best performance. YOLOv2, LSTM and CRNN are tested with our work on Intel Arria10 GX1150 FPGA. It achieves 646 GOPS throughput on CRNN, which is the best performance on CNNRNN hybrid networks among high-level-synthesis (HLS) based FPGA accelerators. Moreover, its throughput for CNNs and RNNs is competitive to the state-of-the-art specialized FPGA accelerators.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116564538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Machine Learning Approach for Power Gating the FPGA Routing Network","authors":"Zeinab Seifoori, H. Asadi, Mirjana Stojilović","doi":"10.1109/ICFPT47387.2019.00010","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00010","url":null,"abstract":"Power gating is a common approach for reducing circuit static power consumption. In FPGAs, resources that dominate static power consumption lie in the routing network. Researchers have proposed several heuristics for clustering multiplexers in routing network into power-gating regions. In this paper, we propose a fundamentally different approach based on K-means clustering, an algorithm commonly used in machine learning. Experimental results on Titan benchmarks and Stratix-IV FPGA architecture show that our proposed clustering algorithms outperform the state of the art. For example, for 32 power-gating regions in FPGA routing switch matrices, we achieve (on average) almost 1.4× higher savings (37.48% vs. 26.94%) in the static power consumption of the FPGA routing resources at lower area overhead than the most efficient heuristic published so far.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132891831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}