{"title":"RapidRoute: Fast Assembly of Communication Structures for FPGA Overlays","authors":"Leo Liu, Jay Weng, Nachiket Kapre","doi":"10.1109/FCCM.2019.00018","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00018","url":null,"abstract":"We can implement relocatable, bus-based communication structures on Xilinx FPGAs using RapidWright while delivering competitive frequency, single digit speedups in execution time, and orders of magnitude reduction in memory usage over Xilinx Vivado 2017.2. We develop RapidRoute, a custom router that exploits symmetry in placement and routing of bus endpoints, caching of reusable route segments, selective multi-threading of the router engine, and abutment-friendly tiling heuristics. The key idea is to reduce the amount of work necessary to generate these communication structures through the use of search heuristics, parallelism, and reuse. We are able to outperform Vivado router by as much as 8× for topologies ranging from 1D rings, torii, and meshes, while taking 1000× lower memory footprint, and delivering timing with 0.2ns of Vivado. RapidRoute opens the door to building a family of custom routing tools for constructing FPGA overlays for various application domains.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126988043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the Random Network of Hodgkin and Huxley Neurons with Exponential Synaptic Conductances on OpenCL FPGA Platform","authors":"Zheming Jin, H. Finkel","doi":"10.1109/FCCM.2019.00057","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00057","url":null,"abstract":"We choose a random network of Hodgkin–Huxley (HH) neurons with exponential synaptic conductance as a study of accelerating the simulation of networks of spiking neurons on an FPGA. Focused on the conductance-based HH (COBAHH) benchmark, we execute the benchmark on a general-purpose simulator for spiking neural networks, identify a computationally intensive kernel in the generated C++ code, convert the kernel to a portable OpenCL kernel, and describe the kernel optimizations which can reduce the resource utilizations and improve the kernel performance. We evaluate the kernel on an Intel Arria 10 based FPGA platform, an Intel Xeon 16-core CPU, and an NVIDIA Tesla P100 GPU. FPGAs are promising for the simulation of spiking neuron network.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128333239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Fang, Jianyu Chen, Jinho Lee, Z. Al-Ars, H. P. Hofstee
{"title":"A Fine-Grained Parallel Snappy Decompressor for FPGAs Using a Relaxed Execution Model","authors":"Jian Fang, Jianyu Chen, Jinho Lee, Z. Al-Ars, H. P. Hofstee","doi":"10.1109/FCCM.2019.00076","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00076","url":null,"abstract":"Snappy is a widely used (de) compression algorithm in many big data applications. Such a data compression technique has been proven to be successful to save storage space and to reduce the amount of data transmission from/to storage devices. In this paper, we present a fine-grained parallel Snappy decompressor on FPGAs running under a relaxed execution model that addresses the following main challenges in existing solutions. First, existing designs either can only process one token per cycle or can process multiple tokens per cycle with low area efficiency and/or low clock frequency. Second, the high read-after-write data dependency during decompression introduces stalls which pull down the throughput.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125488108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michail Papadimitriou, J. Fumero, Athanasios Stratikopoulos, Christos Kotselidis
{"title":"Towards Prototyping and Acceleration of Java Programs onto Intel FPGAs","authors":"Michail Papadimitriou, J. Fumero, Athanasios Stratikopoulos, Christos Kotselidis","doi":"10.1109/FCCM.2019.00051","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00051","url":null,"abstract":"In this work, we propose an approach for transparent compilation and execution of Java programs onto Intel FPGA devices. In detail, we showcase how a managed runtime environment can leverage Intel OpenCL SDK to generate specialized FPGA code, enabling prototyping and acceleration of Java Programs onto FPGAs. Finally, we describe our implementation in the context of TornadoVM with a clear objective to ease FPGA programmability allowing integration with existing frameworks.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122142512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Qiao, Zhenman Fang, Mau-Chung Frank Chang, J. Cong
{"title":"An FPGA-Based BWT Accelerator for Bzip2 Data Compression","authors":"W. Qiao, Zhenman Fang, Mau-Chung Frank Chang, J. Cong","doi":"10.1109/FCCM.2019.00023","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00023","url":null,"abstract":"The Burrows-Wheeler Transform (BWT) has played an important role in lossless data compression algorithms. To achieve a good compression ratio, the BWT block size needs to be several hundreds of kilobytes, which requires a large amount of on-chip memory resources and limits effective hardware implementations. In this paper, we analyze the bottleneck of the BWT acceleration and present a novel design to map the anti-sequential suffix sorting algorithm to FPGAs. Our design can perform BWT with a block size of up to 500KB (i.e., bzip2 level 5 compression) on the Xilinx Virtex UltraScale+ VCU1525 board, while the state-of-art FPGA implementation can only support 4KB block size. Experiments show our FPGA design can achieve ~2x speedup compared to the best CPU implementation using standard large Corpus benchmarks.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122172028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Matthews, Alec Lu, Zhenman Fang, Lesley Shannon
{"title":"Rethinking Integer Divider Design for FPGA-Based Soft-Processors","authors":"Eric Matthews, Alec Lu, Zhenman Fang, Lesley Shannon","doi":"10.1145/3502492","DOIUrl":"https://doi.org/10.1145/3502492","url":null,"abstract":"Most existing soft-processors on FPGAs today support a fixed-latency instruction pipeline. Therefore, for integer division, a simple fixed-latency radix-2 integer divider is typically used, or algorithm-level changes are made to avoid integer divisions. However, for certain important application domains the simple radix-2 integer divider becomes the performance bottleneck, as every 32-bit division operation takes 32 cycles. In this paper, we explore integer divider designs for FPGA-based soft-processors, by leveraging the recent support of variable-latency execution units in their instruction pipeline. We implement a high-performance, data-dependent, variable-latency integer divider called Quick-Div, optimize its performance on FPGAs, and integrate it into a RISC-V soft-processor called Taiga that supports a variable-latency instruction pipeline. We perform a comprehensive analysis and comparison—in terms of cycles, clock frequency, and resource usage—for both the fixed-latency radix-2/4/8/16 dividers and our variable-latency Quick-Div divider with various optimizations. Experimental results on a Xilinx Virtex UltraScale+ VCU118 FPGA board show that our Quick-Div divider can provide over 5x better performance and over 4x better performance/LUT compared to a radix-2 divider for certain applications like random number generation. Finally, through a case study of integer square root, we demonstrate that our Quick-Div divider provides opportunities for reconsidering simpler and faster algorithmic choices.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122992551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An OpenCL-Based Acceleration for Canny Algorithm Using a Heterogeneous CPU-FPGA Platform","authors":"Samah Rahamneh, L. Sawalha","doi":"10.1109/FCCM.2019.00063","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00063","url":null,"abstract":"Field programmable gate arrays (FPGAs) provide both performance and power benefits to heterogeneous systems. In this work, we used a closely-coupled CPU-FPGA heterogeneous system to accelerate Canny edge detector algorithm and compared the performance of the hybrid implementation with that of the optimized separate CPU and FPGA implementations. Our results show up to 4.8X speedup for the hybrid implementation over the CPU only implementation and up to 2.1X over the FPGA only implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124540507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Siracusa, Marco Rabozzi, Emanuele Del Sozzo, M. Santambrogio, Lorenzo Di Tucci
{"title":"Automated Design Space Exploration and Roofline Analysis for FPGA-Based HLS Applications","authors":"Marco Siracusa, Marco Rabozzi, Emanuele Del Sozzo, M. Santambrogio, Lorenzo Di Tucci","doi":"10.1109/FCCM.2019.00055","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00055","url":null,"abstract":"The growing interest in FPGA-based solutions for accelerating compute demanding algorithms is pushing the need for new tools and methods to improve productivity. In this work, we propose a methodology to support designers in generating optimal FPGA hardware implementations using High-Level Synthesis (HLS). First, we propose an automated roofline model generation that operates directly on a C/C++ description of the algorithm. The approach enables fast evaluation of the operational intensity of the target function and visualizes the main bottlenecks of the current HLS implementation, providing guidance on how to improve it. Second, we integrate it with a Design Space Exploration (DSE) methodology for quickly evaluating different HLS directives to identify an optimal implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124212222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erwei Wang, James J. Davis, P. Cheung, G. Constantinides
{"title":"LUTNet: Rethinking Inference in FPGA Soft Logic","authors":"Erwei Wang, James J. Davis, P. Cheung, G. Constantinides","doi":"10.1109/FCCM.2019.00014","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00014","url":null,"abstract":"Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarised neural network implementation, we achieve twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124242525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Wire-Speed Multirate Accelerator for Aggregation Operations on Sorted Data","authors":"S. Jun, A. Arvind","doi":"10.1109/FCCM.2019.00065","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00065","url":null,"abstract":"We present an accelerator architecture for wire-speed aggregation of sorted key-value pairs on a wide datapath, in a bump-in-the-wire fashion. The presented accelerator is capable of maintaining wire-speed regardless of data distribution, even when (1) the aggregation function has multiple-cycle latency, and (2) the input stream is multi-rate, i.e., multiple elements arrive every cycle. To the best of our knowledge, it is the first accelerator architecture that satisfies both properties.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2014 27","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120969916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}