2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

筛选
英文 中文
RapidRoute: Fast Assembly of Communication Structures for FPGA Overlays RapidRoute: FPGA覆盖层通信结构的快速组装
Leo Liu, Jay Weng, Nachiket Kapre
{"title":"RapidRoute: Fast Assembly of Communication Structures for FPGA Overlays","authors":"Leo Liu, Jay Weng, Nachiket Kapre","doi":"10.1109/FCCM.2019.00018","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00018","url":null,"abstract":"We can implement relocatable, bus-based communication structures on Xilinx FPGAs using RapidWright while delivering competitive frequency, single digit speedups in execution time, and orders of magnitude reduction in memory usage over Xilinx Vivado 2017.2. We develop RapidRoute, a custom router that exploits symmetry in placement and routing of bus endpoints, caching of reusable route segments, selective multi-threading of the router engine, and abutment-friendly tiling heuristics. The key idea is to reduce the amount of work necessary to generate these communication structures through the use of search heuristics, parallelism, and reuse. We are able to outperform Vivado router by as much as 8× for topologies ranging from 1D rings, torii, and meshes, while taking 1000× lower memory footprint, and delivering timing with 0.2ns of Vivado. RapidRoute opens the door to building a family of custom routing tools for constructing FPGA overlays for various application domains.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126988043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Exploring the Random Network of Hodgkin and Huxley Neurons with Exponential Synaptic Conductances on OpenCL FPGA Platform 基于OpenCL FPGA平台的指数突触电导Hodgkin和Huxley神经元随机网络研究
Zheming Jin, H. Finkel
{"title":"Exploring the Random Network of Hodgkin and Huxley Neurons with Exponential Synaptic Conductances on OpenCL FPGA Platform","authors":"Zheming Jin, H. Finkel","doi":"10.1109/FCCM.2019.00057","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00057","url":null,"abstract":"We choose a random network of Hodgkin–Huxley (HH) neurons with exponential synaptic conductance as a study of accelerating the simulation of networks of spiking neurons on an FPGA. Focused on the conductance-based HH (COBAHH) benchmark, we execute the benchmark on a general-purpose simulator for spiking neural networks, identify a computationally intensive kernel in the generated C++ code, convert the kernel to a portable OpenCL kernel, and describe the kernel optimizations which can reduce the resource utilizations and improve the kernel performance. We evaluate the kernel on an Intel Arria 10 based FPGA platform, an Intel Xeon 16-core CPU, and an NVIDIA Tesla P100 GPU. FPGAs are promising for the simulation of spiking neuron network.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128333239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Fine-Grained Parallel Snappy Decompressor for FPGAs Using a Relaxed Execution Model 使用放松执行模型的fpga细粒度并行快速减压器
Jian Fang, Jianyu Chen, Jinho Lee, Z. Al-Ars, H. P. Hofstee
{"title":"A Fine-Grained Parallel Snappy Decompressor for FPGAs Using a Relaxed Execution Model","authors":"Jian Fang, Jianyu Chen, Jinho Lee, Z. Al-Ars, H. P. Hofstee","doi":"10.1109/FCCM.2019.00076","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00076","url":null,"abstract":"Snappy is a widely used (de) compression algorithm in many big data applications. Such a data compression technique has been proven to be successful to save storage space and to reduce the amount of data transmission from/to storage devices. In this paper, we present a fine-grained parallel Snappy decompressor on FPGAs running under a relaxed execution model that addresses the following main challenges in existing solutions. First, existing designs either can only process one token per cycle or can process multiple tokens per cycle with low area efficiency and/or low clock frequency. Second, the high read-after-write data dependency during decompression introduces stalls which pull down the throughput.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125488108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Towards Prototyping and Acceleration of Java Programs onto Intel FPGAs Java程序在Intel fpga上的原型设计与加速
Michail Papadimitriou, J. Fumero, Athanasios Stratikopoulos, Christos Kotselidis
{"title":"Towards Prototyping and Acceleration of Java Programs onto Intel FPGAs","authors":"Michail Papadimitriou, J. Fumero, Athanasios Stratikopoulos, Christos Kotselidis","doi":"10.1109/FCCM.2019.00051","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00051","url":null,"abstract":"In this work, we propose an approach for transparent compilation and execution of Java programs onto Intel FPGA devices. In detail, we showcase how a managed runtime environment can leverage Intel OpenCL SDK to generate specialized FPGA code, enabling prototyping and acceleration of Java Programs onto FPGAs. Finally, we describe our implementation in the context of TornadoVM with a clear objective to ease FPGA programmability allowing integration with existing frameworks.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122142512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
An FPGA-Based BWT Accelerator for Bzip2 Data Compression 基于fpga的Bzip2数据压缩BWT加速器
W. Qiao, Zhenman Fang, Mau-Chung Frank Chang, J. Cong
{"title":"An FPGA-Based BWT Accelerator for Bzip2 Data Compression","authors":"W. Qiao, Zhenman Fang, Mau-Chung Frank Chang, J. Cong","doi":"10.1109/FCCM.2019.00023","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00023","url":null,"abstract":"The Burrows-Wheeler Transform (BWT) has played an important role in lossless data compression algorithms. To achieve a good compression ratio, the BWT block size needs to be several hundreds of kilobytes, which requires a large amount of on-chip memory resources and limits effective hardware implementations. In this paper, we analyze the bottleneck of the BWT acceleration and present a novel design to map the anti-sequential suffix sorting algorithm to FPGAs. Our design can perform BWT with a block size of up to 500KB (i.e., bzip2 level 5 compression) on the Xilinx Virtex UltraScale+ VCU1525 board, while the state-of-art FPGA implementation can only support 4KB block size. Experiments show our FPGA design can achieve ~2x speedup compared to the best CPU implementation using standard large Corpus benchmarks.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122172028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Rethinking Integer Divider Design for FPGA-Based Soft-Processors 基于fpga的软处理器整数除法器设计的再思考
Eric Matthews, Alec Lu, Zhenman Fang, Lesley Shannon
{"title":"Rethinking Integer Divider Design for FPGA-Based Soft-Processors","authors":"Eric Matthews, Alec Lu, Zhenman Fang, Lesley Shannon","doi":"10.1145/3502492","DOIUrl":"https://doi.org/10.1145/3502492","url":null,"abstract":"Most existing soft-processors on FPGAs today support a fixed-latency instruction pipeline. Therefore, for integer division, a simple fixed-latency radix-2 integer divider is typically used, or algorithm-level changes are made to avoid integer divisions. However, for certain important application domains the simple radix-2 integer divider becomes the performance bottleneck, as every 32-bit division operation takes 32 cycles. In this paper, we explore integer divider designs for FPGA-based soft-processors, by leveraging the recent support of variable-latency execution units in their instruction pipeline. We implement a high-performance, data-dependent, variable-latency integer divider called Quick-Div, optimize its performance on FPGAs, and integrate it into a RISC-V soft-processor called Taiga that supports a variable-latency instruction pipeline. We perform a comprehensive analysis and comparison—in terms of cycles, clock frequency, and resource usage—for both the fixed-latency radix-2/4/8/16 dividers and our variable-latency Quick-Div divider with various optimizations. Experimental results on a Xilinx Virtex UltraScale+ VCU118 FPGA board show that our Quick-Div divider can provide over 5x better performance and over 4x better performance/LUT compared to a radix-2 divider for certain applications like random number generation. Finally, through a case study of integer square root, we demonstrate that our Quick-Div divider provides opportunities for reconsidering simpler and faster algorithmic choices.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122992551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
An OpenCL-Based Acceleration for Canny Algorithm Using a Heterogeneous CPU-FPGA Platform 基于opencl的Canny算法异构CPU-FPGA加速
Samah Rahamneh, L. Sawalha
{"title":"An OpenCL-Based Acceleration for Canny Algorithm Using a Heterogeneous CPU-FPGA Platform","authors":"Samah Rahamneh, L. Sawalha","doi":"10.1109/FCCM.2019.00063","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00063","url":null,"abstract":"Field programmable gate arrays (FPGAs) provide both performance and power benefits to heterogeneous systems. In this work, we used a closely-coupled CPU-FPGA heterogeneous system to accelerate Canny edge detector algorithm and compared the performance of the hybrid implementation with that of the optimized separate CPU and FPGA implementations. Our results show up to 4.8X speedup for the hybrid implementation over the CPU only implementation and up to 2.1X over the FPGA only implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124540507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automated Design Space Exploration and Roofline Analysis for FPGA-Based HLS Applications 基于fpga的HLS应用的自动设计空间探索和车顶线分析
Marco Siracusa, Marco Rabozzi, Emanuele Del Sozzo, M. Santambrogio, Lorenzo Di Tucci
{"title":"Automated Design Space Exploration and Roofline Analysis for FPGA-Based HLS Applications","authors":"Marco Siracusa, Marco Rabozzi, Emanuele Del Sozzo, M. Santambrogio, Lorenzo Di Tucci","doi":"10.1109/FCCM.2019.00055","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00055","url":null,"abstract":"The growing interest in FPGA-based solutions for accelerating compute demanding algorithms is pushing the need for new tools and methods to improve productivity. In this work, we propose a methodology to support designers in generating optimal FPGA hardware implementations using High-Level Synthesis (HLS). First, we propose an automated roofline model generation that operates directly on a C/C++ description of the algorithm. The approach enables fast evaluation of the operational intensity of the target function and visualizes the main bottlenecks of the current HLS implementation, providing guidance on how to improve it. Second, we integrate it with a Design Space Exploration (DSE) methodology for quickly evaluating different HLS directives to identify an optimal implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124212222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
LUTNet: Rethinking Inference in FPGA Soft Logic FPGA软逻辑推理的再思考
Erwei Wang, James J. Davis, P. Cheung, G. Constantinides
{"title":"LUTNet: Rethinking Inference in FPGA Soft Logic","authors":"Erwei Wang, James J. Davis, P. Cheung, G. Constantinides","doi":"10.1109/FCCM.2019.00014","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00014","url":null,"abstract":"Research has shown that deep neural networks contain significant redundancy, and that high classification accuracies can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. We demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy. Against the state-of-the-art binarised neural network implementation, we achieve twice the area efficiency for several standard network models when inferencing popular datasets. We also demonstrate that even greater energy efficiency improvements are obtainable.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124242525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Wire-Speed Multirate Accelerator for Aggregation Operations on Sorted Data 用于排序数据聚合操作的线速多速率加速器
S. Jun, A. Arvind
{"title":"Wire-Speed Multirate Accelerator for Aggregation Operations on Sorted Data","authors":"S. Jun, A. Arvind","doi":"10.1109/FCCM.2019.00065","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00065","url":null,"abstract":"We present an accelerator architecture for wire-speed aggregation of sorted key-value pairs on a wide datapath, in a bump-in-the-wire fashion. The presented accelerator is capable of maintaining wire-speed regardless of data distribution, even when (1) the aggregation function has multiple-cycle latency, and (2) the input stream is multi-rate, i.e., multiple elements arrive every cycle. To the best of our knowledge, it is the first accelerator architecture that satisfies both properties.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2014 27","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120969916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信