{"title":"An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs","authors":"Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, Yun Liang","doi":"10.1109/FCCM.2019.00013","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00013","url":null,"abstract":"Deep convolutional neural networks (CNN) have achieved remarkable performance with the cost of huge computation. As the CNN model becomes more complex and deeper, compressing CNN to sparse by pruning the redundant connection in networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. In recent years, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA architectures focus on dense CNN models. The architecture designed for dense CNN models are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. On the other hand, recent sparse FPGA accelerators only focus on FC layers. In this work, we aim to develop an FPGA accelerator for sparse CNNs. To efficiently deal with the irregular connection in the sparse convolutional layer, we propose a weight-oriented dataflow that processes each weight individually. Then we design an FPGA architecture which can handle input-weight connection and weight-output connection efficiently. For input-weight connection, we design a tile look-up table to eliminate the runtime indexing match of compressed weights. Moreover, we develop a weight layout to enable high on-chip memory access. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address which can ensure no data access conflict. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGA accelerators.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122453008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lili Liu, Xiaoqiang Xiang, Yuxiang Xie, Yongjie Li, Bo Yan, Jun Zhou
{"title":"A High Throughput and Energy-Efficient Retina-Inspired Tone Mapping Processor","authors":"Lili Liu, Xiaoqiang Xiang, Yuxiang Xie, Yongjie Li, Bo Yan, Jun Zhou","doi":"10.1109/FCCM.2019.00062","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00062","url":null,"abstract":"This paper presents a high throughput and energy-efficient retina inspired tone mapping processor. Several hardware design techniques have been proposed to achieve high throughput and high energy efficiency, including data partition based parallel processing with S-shape sliding, adjacent frame feature sharing, multi-layer convolution pipelining and convolution filter compression with zero skipping convolution. The proposed processor has been implemented on a Xilinx's Virtex7 FPGA for demonstration. It is able to achieve a throughput of 189 frames per second for 1024*768 RGB images with 819 mW. Compared with several state-of-the-art tone mapping processors, the proposed processor achieves higher throughput and energy efficiency. It is suitable for high-speed and energy-constrained video enhancement applications such as autonomous vehicle and drone monitoring.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127004166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rafael Zamacola, A. García-Martínez, J. Mora, A. Otero, E. D. L. Torre
{"title":"Automated Tool and Runtime Support for Fine-Grain Reconfiguration in Highly Flexible Reconfigurable Systems","authors":"Rafael Zamacola, A. García-Martínez, J. Mora, A. Otero, E. D. L. Torre","doi":"10.1109/FCCM.2019.00048","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00048","url":null,"abstract":"Dynamic partial reconfiguration significantly reduces reconfiguration times when offloading a partial design. However, there are occasions when fine-tuning a circuit would greatly benefit from quicker reconfiguration times. To that end, authors present an automated tool and runtime support to reconfigure LUT-based multiplexers and constants. In contrast to conventional multiplexers and constants, it is possible to modify these components without having a direct communication with the static system.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132991746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantinos Iordanou, Oscar Palomar, John Mawer, Cosmin Gorgovan, A. Nisbet, M. Luján
{"title":"SimAcc: A Configurable Cycle-Accurate Simulator for Customized Accelerators on CPU-FPGAs SoCs","authors":"Konstantinos Iordanou, Oscar Palomar, John Mawer, Cosmin Gorgovan, A. Nisbet, M. Luján","doi":"10.1109/FCCM.2019.00031","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00031","url":null,"abstract":"This paper describes a flexible infrastructure for fast computer architecture simulation and prototyping of accelerator IP. A trend for System-on-Chips is to include application specific accelerators on the die. However, there is still a key research problem that needs to be addressed: How do hardware accelerators interact with the processors of a system and what is the impact on overall performance? To solve this problem, we propose an infrastructure that can directly simulate unmodified application executables with FPGA hardware accelerators. Unmodified application binaries are dynamically instrumented to generate processor load/store and program counter events and any memory accesses generated by accelerators, that are sent to an FPGA-based out-of-order pipeline model. The key features of our infrastructure are the ability to code exclusively at the user level, to dynamically discover and use available hardware models at run time, to test and simultaneously optimize hardware accelerators in an heterogeneous system. In terms of evaluation, we present a comparison between our system and Gem5 to demonstrate accuracy and relative performance, using the SPEC CPU benchmarks; even though our system is implemented on Zynq XC7045 which integrates dual 667MHz Arm Cortex-A9s with substantial FPGA resources, it outperforms Gem5 running on a Xeon E3 3.2 GHz with 32GBs of RAM. We also evaluate our infrastructure in simulating the interaction of accelerators with processors using accelerators taken from the Mach Benchmark Suite and other custom accelerators from computer vision applications.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EFCAD — An Embedded FPGA CAD Tool Flow for Enabling On-chip Self-Compilation","authors":"K. Pham, Malte Vesper, Dirk Koch, Eddie Hung","doi":"10.1109/FCCM.2019.00011","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00011","url":null,"abstract":"This paper combines a chain of academic tools to form an FPGA compilation flow for building partially reconfigurable modules on lightweight embedded platforms. Our flow — EFCAD — supports the entire stack from RTL (Verilog) to (partial) bitstream, and we demonstrate early results from the onchip ARM processor of, and targeting, the latest 16nm generation of a Zynq UltraScale+ MPSoC device. With this, we complement Xilinx's PYNQ initiative to not only facilitate System-on-Chip research and education entirely within an embedded system, but also to allow building new and specialising existing customcomputing accelerators without needing access to a workstation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130747768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Irregular Memory Parallelism in Quasi-Stencils through Nonlinear Transformation","authors":"Juan Escobedo, Mingjie Lin","doi":"10.1109/FCCM.2019.00039","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00039","url":null,"abstract":"Non-stencil kernels with irregular memory accesses pose unique challenges to achieving high computing performance and hardware efficiency in high-level synthesis (HLS) of FPGA. We present a highly versatile and systematic approach to effectively synthesizing a special and important subset of non-stencil computing kernels, quasi-stencils, which possess the mathematical property that, if studied in a particular kind of high-dimensional space corresponding to the prime factorization space, the distance between the memory accesses during each kernel iteration becomes constant and such an irregular non-stencil can be considered as a stencil. This opens the door to exploiting a vast array of existing memory optimization algorithms, such as memory partitioning/banking and data reuse, originally designed for the standard stencil-based kernel computing, therefore offering totally new opportunity to effectively synthesizing irregular non-stencil kernels. We show the feasibility of our approach implementing our methodology in a KC705 Xilinx FPGA board and tested it with several custom code segments that meet the quasi-stencil requirement vs some of the state-of the art methods in memory partitioning. We achieve significant reduction in partition factor, and perhaps more importantly making it proportional to the number of memory accesses instead of depending on the problem size with the cost of some wasted space.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125560401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Monobit Wideband Receiver with Integrated Dithering in FPGA","authors":"Dan Pritsker, Colman Cheung","doi":"10.1109/FCCM.2019.00073","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00073","url":null,"abstract":"This work presents an innovative and very competitive approach to re-purpose FPGA digital high-speed transceivers to sample wideband analog signals while achieving excellent sampling quality. Such solution can achieve 16+GHz instantaneous bandwidth using existing technology in Stratix-V FPGA family","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126913656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FASE: FPGA Acceleration of Secure Function Evaluation","authors":"S. Hussain, F. Koushanfar","doi":"10.1109/FCCM.2019.00045","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00045","url":null,"abstract":"We present FASE, an FPGA accelerator for Secure Function Evaluation (SFE) by employing the well-known cryptographic protocol named Yao's Garbled Circuit (GC). SFE allows two parties to jointly compute a function on their private data and learn the output without revealing their inputs to each other. FASE is designed to allow cloud servers to provide secure services to a large number of clients in parallel while preserving the privacy of the data from both sides. Current SFE accelerators either target specific applications, and therefore are not amenable to generic use, or have low throughput due to inefficient management of resources. In this work, we present a pipelined architecture along with an efficient scheduling scheme to ensure optimal usage of the available resources. The scheme is built around a simulator of the hardware design that schedules the workload and assigns the most suitable task to the encryption cores at each cycle. This, coupled with optimal management of the read and write cycles of the Block RAM on FPGA, results in a minimum 2 orders of magnitude improvement in terms of throughput per core for the reported benchmarks compared to the most recent generic GC accelerator. Moreover, our encryption core requires 17% less resource compared to the most recent secure GC realization.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"271 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115819131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Milan Ceska, Vojtěch Havlena, L. Holík, J. Korenek, Ondřej Lengál, Denis Matousek, J. Matoušek, Jakub Semric, Tomáš Vojnar
{"title":"Deep Packet Inspection in FPGAs via Approximate Nondeterministic Automata","authors":"Milan Ceska, Vojtěch Havlena, L. Holík, J. Korenek, Ondřej Lengál, Denis Matousek, J. Matoušek, Jakub Semric, Tomáš Vojnar","doi":"10.1109/FCCM.2019.00025","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00025","url":null,"abstract":"Deep packet inspection via regular expression (RE) matching is a crucial task of network intrusion detection systems (IDSes), which secure Internet connection against attacks and suspicious network traffic. Monitoring high-speed computer networks (100 Gbps and faster) in a single-box solution demands that the RE matching, traditionally based on finite automata (FAs), is accelerated in hardware. In this paper, we describe a novel FPGA architecture for RE matching that is able to process network traffic beyond 100 Gbps. The key idea is to reduce the required FPGA resources by leveraging approximate nondeterministic FAs (NFAs). The NFAs are compiled into a multi-stage architecture starting with the least precise stage with a high throughput and ending with the most precise stage with a low throughput. To obtain the reduced NFAs, we propose new approximate reduction techniques that take into account the profile of the network traffic. Our experiments showed that using our approach, we were able to perform matching of large sets of REs from SNORT, a popular IDS, on unprecedented network speeds.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"323 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129773658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong
{"title":"PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks","authors":"Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong","doi":"10.1109/FCCM.2019.00015","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00015","url":null,"abstract":"Quantisation is a key optimisation strategy to improve the performance of floating-point deep neural network (DNN) accelerators. Digital signal processing (DSP) blocks on field-programmable gate arrays are not efficiently utilised when the accelerator precision is much lower than the DSP precision. Through three modifications to Xilinx DSP48E2 DSP blocks, we address this issue for important computations in embedded DNN accelerators, namely the standard, depth-wise, and pointwise convolutional layers. First, we propose a flexible precision, run-time decomposable multiplier architecture for CNN implementations. Second, we propose a significant upgrade to DSPDSP interconnect, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. Finally, we improve data reuse via a register file which can also be configured as FIFO. Compared with the 27 × 18-bit mode in the Xilinx DSP48E2, our Precision, Interconnect, and Reuseoptimised DSP (PIR-DSP) offers a 6× improvement in multiplyaccumulate operations per DSP in the 9 × 9-bit case, 12× for 4 × 4 bits, and 24× for 2 × 2 bits. We estimate that PIR-DSP decreases the run time energy to 31/19/13% of the original value in a 9/4/2-bit MobileNet-v2 DNN implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129506334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}