2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

Centaur: A Framework for Hybrid CPU-FPGA Databases Centaur:一个混合CPU-FPGA数据库框架

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-06-30 DOI: 10.1109/FCCM.2017.37

Muhsen Owaida, David Sidler, Kaan Kara, G. Alonso

{"title":"Centaur: A Framework for Hybrid CPU-FPGA Databases","authors":"Muhsen Owaida, David Sidler, Kaan Kara, G. Alonso","doi":"10.1109/FCCM.2017.37","DOIUrl":"https://doi.org/10.1109/FCCM.2017.37","url":null,"abstract":"Accelerating relational databases in general and SQL in particular has become an important topic given thechallenges arising from large data collections and increasinglycomplex workloads. Most existing work, however, has beenfocused on either accelerating a single operator (e.g., a join) orin data reduction along the data path (e.g., from disk to CPU). In this paper we focus instead on the system aspects of accelerating a relational engine in hybrid CPU-FPGA architectures. In particular, we present Centaur, a framework running on theFPGA that allows the dynamic allocation of FPGA operatorsto query plans, pipelining these operators among themselveswhen needed, and the hybrid execution of operator pipelinesrunning on the CPU and the FPGA. Centaur is fully compatiblewith relational engines as we demonstrate through its seamlessintegration with MonetDB, a popular column store database. Inthe paper, we describe how this integration is achieved, andempirically demonstrate the advantages of such an approach. The main contribution of the paper is to provide a realisticsolution for accelerating SQL that is compatible with existingdatabase architectures, thereby opening up the possibilities forfurther exploration of FPGA based data processing.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128631531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

A High-Level Synthesis Approach Optimizing Accumulations in Floating-Point Programs Using Custom Formats and Operators 使用自定义格式和操作符优化浮点程序累积的高级综合方法

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-30 DOI: 10.1109/FCCM.2017.41

Yohann Uguen, F. D. Dinechin, Steven Derrien

{"title":"A High-Level Synthesis Approach Optimizing Accumulations in Floating-Point Programs Using Custom Formats and Operators","authors":"Yohann Uguen, F. D. Dinechin, Steven Derrien","doi":"10.1109/FCCM.2017.41","DOIUrl":"https://doi.org/10.1109/FCCM.2017.41","url":null,"abstract":"Many case studies have demonstrated the potential of Field-Programmable Gate Arrays (FPGAs) as accelerators for a wide range of applications. FPGAs offer massive parallelism and programmability at the bit level. This enables programmers to exploit a range of techniques that avoid many bottlenecks of classical von Neumann computing. However, development costs for FPGAs are orders of magnitude higher than classical programming. A solution would be the use of High-Level Synthesis (HLS) tools, which use C as a hardware description language. However, the C language was designed to be executed on general purpose processors, not to generate hardware. Its datatypes and operators are limited to a small number (more or less matching the hardware operators present in mainstream processors), and HLS tools inherit these limitations. To better exploit the freedom offered by hardware and FPGAs, HLS vendors have enriched the C language with integer and fixed-point types of arbitrary size. Still, the operations on these types remain limited to the basic arithmetic and logic ones. In floating point, the current situation is even worse. The operator set is limited, and the sizes are restricted to 32 and 64 bits. Besides, most recent compilers, including the HLS ones, attempt to follow established standards, in particular C11 and IEEE-754. This ensures bit-exact compatibility with software, but greatly reduces the freedom of optimization by the compiler. For instance, a floating point addition is not associative even though its real equivalent is. In the present work we attempt to give the compiler more freedom. For this, we sacrifice the strict respect of the IEEE-754 and C11 standards, but we replace it with the strict respect of a high-level accuracy specification expressed by the programmer through a pragma. The case study in this work is a program transformation that applies to floating-point additions on a loop's critical path. It decomposes them into elementary steps, resizes the corresponding subcomponents to guarantee some user-specified accuracy, and merges and reorders these components to improve performance. The result of this complex sequence of optimizations could not be obtained from an operator generator, since it involves global loop information. For this purpose, we used a compilation flow involving one or several source-to-source transformations operating on the code given to HLS tools (Figure 1).The proposed transformation already works very well on 3 of the 10 FPMarks where it improves both latency and accuracy by an order of magnitude for comparable area. For 2 more benchmarks, the latency is not improved (but not degraded either) due to current limitations of HLS tools. This defines short-term future work. The main result of this work is that HLS tools also have the potential to generate efficient designs for handling floating-point computations in a completely non-standard way. In the longer term, we believe that HLS flows can not only import","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"668 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115750869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Nanosecond–Level Hybrid Table Design for Financial Market Data Generators 金融市场数据生成器的纳秒级混合表设计

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-30 DOI: 10.1109/FCCM.2017.30

H. Fu, Conghui He, W. Luk, Weijia Li, Guangwen Yang

引用次数: 1

Applying the Flask Security Architecture to Secure SoC Design Flask安全架构在SoC设计中的应用

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.28

Festus Hategekimana, C. Bobda

引用次数: 1

An FPGA Design Framework for CNN Sparsification and Acceleration CNN稀疏化和加速的FPGA设计框架

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.21

Sicheng Li, W. Wen, Yu Wang, Song Han, Yiran Chen, Hai Helen Li

{"title":"An FPGA Design Framework for CNN Sparsification and Acceleration","authors":"Sicheng Li, W. Wen, Yu Wang, Song Han, Yiran Chen, Hai Helen Li","doi":"10.1109/FCCM.2017.21","DOIUrl":"https://doi.org/10.1109/FCCM.2017.21","url":null,"abstract":"Convolutional neural networks (CNNs) have recently broken many performance records in image recognition and object detection problems. The success of CNNs, to a great extent, is enabled by the fast scaling-up of the networks that learn from a huge volume of data. The deployment of big CNN models can be both computation-intensive and memory-intensive, leaving severe challenges to hardware implementations. In recent years, sparsification techniques that prune redundant connections in the networks while still retaining the similar accuracy emerge as promising solutions to alliterate the computation overheads associated with CNNs [1]. However, imposing sparsity in CNNs usually generates random network connections and thus, the irregular data access pattern results in poor data locality. The low computation efficiency of the sparse networks, which is caused by the incurred unbalance in computing resource consumption and low memory bandwidth usage, significantly offsets the theocratical reduction of the computation complexity and limits the execution scalability of CNNs on general- purpose architectures [2]. For instance, as an important computation kernel in CNNs – the sparse convoluation, is usually accelerated by using data compression schemes where only nonzero elements of the kernel weights are stored and sent to multiplication-accumulation computations (MACs) at runtime. However, the relevant executions on CPUs and GPUs reach only 0.1% to 10% of the system peak performance even designated software libraries are applied (e.g., MKL library for CPUs and cuSPARSE library for GPUs). Field programmable gate arrays (FPGAs) have been also extensively studied as an important hardware platform for CNN computations [3]. Different from general-purpose architectures, FPGA allows users to customize the functions and organization of the designed hardware in order to adapt various resource needs and data usage patterns. This characteristic, as we identified in this work, can be leveraged to effectively overcome the main challenges in the execution of sparse CNNs through close coordinations between software and hardware. In particular, the reconfigurability of FPGA helps to 1) better map the sparse CNN onto the hardware for improving computation parallelism and execution efficiency and 2) eliminate the computation cost associated with zero weights and enhance data reuse to alleviate the adverse impacts of the irregular data accesses. In this work, we propose a hardware-software co-design framework to address the above challenges in sparse CNN accelerations. First, we introduce a data locality-aware sparsification scheme that optimizes the structure of the sparse CNN during training phase to make it friendly for hardware mapping. Both memory allocation and data access regularization are considered in the optimization process. Second, we develop a distributed architecture composed of the customized processing elements (PEs) that enables high computation parallelism","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116853219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs 基于fpga的卷积神经网络快速算法评估

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.64

Liqiang Lu, Yun Liang, Qingcheng Xiao, Shengen Yan

{"title":"Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs","authors":"Liqiang Lu, Yun Liang, Qingcheng Xiao, Shengen Yan","doi":"10.1109/FCCM.2017.64","DOIUrl":"https://doi.org/10.1109/FCCM.2017.64","url":null,"abstract":"In recent years, Convolutional Neural Networks (CNNs) have become widely adopted for computer vision tasks. FPGAs have been adequately explored as a promising hardware accelerator for CNNs due to its high performance, energy efficiency, and reconfigurability. However, prior FPGA solutions based on the conventional convolutional algorithm is often bounded by the computational capability of FPGAs (e.g., the number of DSPs). In this paper, we demonstrate that fast Winograd algorithm can dramatically reduce the arithmetic complexity, and improve the performance of CNNs on FPGAs. We first propose a novel architecture for implementing Winograd algorithm on FPGAs. Our design employs line buffer structure to effectively reuse the feature map data among different tiles. We also effectively pipeline the Winograd PE engine and initiate multiple PEs through parallelization. Meanwhile, there exists a complex design space to explore. We propose an analytical model to predict the resource usage and reason about the performance. Then, we use the model to guide a fast design space exploration. Experiments using the state-of-the-art CNNs demonstrate the best performance and energy efficiency on FPGAs. We achieve an average 1006.4 GOP/s for the convolutional layers and 854.6 GOP/s for the overall AlexNet and an average 3044.7 GOP/s for the convolutional layers and 2940.7 GOP/s for the overall VGG16 on Xilinx ZCU102 platform.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123682847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 254

Improved Synthesis of Compressor Trees on FPGAs in High-Level Synthesis 高级综合中fpga压缩树的改进综合

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.11

Le Tu, Yuelai Yuan, Kan Huang, Xiaoqiang Zhang, Zixin Wang, Dihu Chen

引用次数: 0

HLScope: High-Level Performance Debugging for FPGA Designs HLScope: FPGA设计的高级性能调试

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.44

Young-kyu Choi, J. Cong

{"title":"HLScope: High-Level Performance Debugging for FPGA Designs","authors":"Young-kyu Choi, J. Cong","doi":"10.1109/FCCM.2017.44","DOIUrl":"https://doi.org/10.1109/FCCM.2017.44","url":null,"abstract":"In their quest for further optimization, field-programmable gate array (FPGA) designers often spend considerable time trying to identify the performance bottleneck in a current design. But since FPGAs do not have built-in high-level probes for performance analysis, manual effort is required to insert custom hardware monitors. This, however, is a time-consuming process which calls for automation. Previous work automates the process of inserting hardware monitors into the communication channels or the finite-state machine, but the instrumentation is applied in low-level hardware description languages (HDL) which limits the comprehensibility in identifying the root cause of stalls. Instead, we propose a performance debugging methodology based on high-level synthesis (HLS). High-level analysis allows tracing the cause of stalls on a function or loop level, which provides a more intuitive feedback that can be used to pinpoint the performance bottleneck. In this paper we propose HLScope, a source-to-source transformation framework based on Vivado HLS for automated performance analysis. We present a method for analyzing the information collected from the software simulation to estimate the stall rate and its cause without the need for FPGA bitstream generation. For detailed analysis, an in-FPGA analysis method is proposed that can be natively integrated into the HLS environment. Experiments show that the parameter extraction from the simulation process is orders of magnitude faster than bitstream generation, with a 2.2% cycle difference on average. In-FPGA flow consumes only about 170 LUTs and a BRAM per monitored module and provides cycle-accurate results.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130658314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Exploration of FPGA-Based Packet Switches for Rack-Scale Computers on a Board 基于fpga的板上机架级计算机分组交换机的研究

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.35

J. H. Han, N. M. Bojan, A. Moore

引用次数: 0

CPRring: A Structure-Aware Ring-Based Checkpointing Architecture for FPGA Computing CPRring:一种结构感知的基于环的FPGA计算检查点架构

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.60

H. Vu, Shinya Takamaeda-Yamazaki, Takashi Nakada, Y. Nakashima

引用次数: 0