2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

筛选
英文 中文
Centaur: A Framework for Hybrid CPU-FPGA Databases Centaur:一个混合CPU-FPGA数据库框架
Muhsen Owaida, David Sidler, Kaan Kara, G. Alonso
{"title":"Centaur: A Framework for Hybrid CPU-FPGA Databases","authors":"Muhsen Owaida, David Sidler, Kaan Kara, G. Alonso","doi":"10.1109/FCCM.2017.37","DOIUrl":"https://doi.org/10.1109/FCCM.2017.37","url":null,"abstract":"Accelerating relational databases in general and SQL in particular has become an important topic given thechallenges arising from large data collections and increasinglycomplex workloads. Most existing work, however, has beenfocused on either accelerating a single operator (e.g., a join) orin data reduction along the data path (e.g., from disk to CPU). In this paper we focus instead on the system aspects of accelerating a relational engine in hybrid CPU-FPGA architectures. In particular, we present Centaur, a framework running on theFPGA that allows the dynamic allocation of FPGA operatorsto query plans, pipelining these operators among themselveswhen needed, and the hybrid execution of operator pipelinesrunning on the CPU and the FPGA. Centaur is fully compatiblewith relational engines as we demonstrate through its seamlessintegration with MonetDB, a popular column store database. Inthe paper, we describe how this integration is achieved, andempirically demonstrate the advantages of such an approach. The main contribution of the paper is to provide a realisticsolution for accelerating SQL that is compatible with existingdatabase architectures, thereby opening up the possibilities forfurther exploration of FPGA based data processing.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128631531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
A High-Level Synthesis Approach Optimizing Accumulations in Floating-Point Programs Using Custom Formats and Operators 使用自定义格式和操作符优化浮点程序累积的高级综合方法
Yohann Uguen, F. D. Dinechin, Steven Derrien
{"title":"A High-Level Synthesis Approach Optimizing Accumulations in Floating-Point Programs Using Custom Formats and Operators","authors":"Yohann Uguen, F. D. Dinechin, Steven Derrien","doi":"10.1109/FCCM.2017.41","DOIUrl":"https://doi.org/10.1109/FCCM.2017.41","url":null,"abstract":"Many case studies have demonstrated the potential of Field-Programmable Gate Arrays (FPGAs) as accelerators for a wide range of applications. FPGAs offer massive parallelism and programmability at the bit level. This enables programmers to exploit a range of techniques that avoid many bottlenecks of classical von Neumann computing. However, development costs for FPGAs are orders of magnitude higher than classical programming. A solution would be the use of High-Level Synthesis (HLS) tools, which use C as a hardware description language. However, the C language was designed to be executed on general purpose processors, not to generate hardware. Its datatypes and operators are limited to a small number (more or less matching the hardware operators present in mainstream processors), and HLS tools inherit these limitations. To better exploit the freedom offered by hardware and FPGAs, HLS vendors have enriched the C language with integer and fixed-point types of arbitrary size. Still, the operations on these types remain limited to the basic arithmetic and logic ones. In floating point, the current situation is even worse. The operator set is limited, and the sizes are restricted to 32 and 64 bits. Besides, most recent compilers, including the HLS ones, attempt to follow established standards, in particular C11 and IEEE-754. This ensures bit-exact compatibility with software, but greatly reduces the freedom of optimization by the compiler. For instance, a floating point addition is not associative even though its real equivalent is. In the present work we attempt to give the compiler more freedom. For this, we sacrifice the strict respect of the IEEE-754 and C11 standards, but we replace it with the strict respect of a high-level accuracy specification expressed by the programmer through a pragma. The case study in this work is a program transformation that applies to floating-point additions on a loop's critical path. It decomposes them into elementary steps, resizes the corresponding subcomponents to guarantee some user-specified accuracy, and merges and reorders these components to improve performance. The result of this complex sequence of optimizations could not be obtained from an operator generator, since it involves global loop information. For this purpose, we used a compilation flow involving one or several source-to-source transformations operating on the code given to HLS tools (Figure 1).The proposed transformation already works very well on 3 of the 10 FPMarks where it improves both latency and accuracy by an order of magnitude for comparable area. For 2 more benchmarks, the latency is not improved (but not degraded either) due to current limitations of HLS tools. This defines short-term future work. The main result of this work is that HLS tools also have the potential to generate efficient designs for handling floating-point computations in a completely non-standard way. In the longer term, we believe that HLS flows can not only import","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"668 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115750869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Nanosecond–Level Hybrid Table Design for Financial Market Data Generators 金融市场数据生成器的纳秒级混合表设计
H. Fu, Conghui He, W. Luk, Weijia Li, Guangwen Yang
{"title":"A Nanosecond–Level Hybrid Table Design for Financial Market Data Generators","authors":"H. Fu, Conghui He, W. Luk, Weijia Li, Guangwen Yang","doi":"10.1109/FCCM.2017.30","DOIUrl":"https://doi.org/10.1109/FCCM.2017.30","url":null,"abstract":"This paper proposes a hybrid sorted table design for minimizing electronic trading latency, with three main contributions. First, a hierarchical sorted table with two levels, a fast cache table in reconfigurable hardware storing megabytes of data items and a master table in software storing gigabytes of data items. Second, a full set of operations, including insertion, deletion, selection and sorting, for the hybrid table with latency in a few cycles. Third, an on-demand synchronization scheme between the cache table and the master table. An implementation has been developed that targets an FPGA-based network card in the environment of the China Financial Futures Exchange (CFFEX) which sustains 1-10Gb/s bandwidth with latency of 400 to 700 nanoseconds, providing an 80- to 125-fold latency reduction compared to a fully optimized CPU-based solution, and a 2.2-fold reduction over an existing FPGA-based solution.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133919522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Applying the Flask Security Architecture to Secure SoC Design Flask安全架构在SoC设计中的应用
Festus Hategekimana, C. Bobda
{"title":"Applying the Flask Security Architecture to Secure SoC Design","authors":"Festus Hategekimana, C. Bobda","doi":"10.1109/FCCM.2017.28","DOIUrl":"https://doi.org/10.1109/FCCM.2017.28","url":null,"abstract":"We explore a reference monitor (RM) design which borrows from the Flask security architecture. Our RM design goal is to achieve complete mediation by checking and verifying the authority and authenticity of every access to every system object. Access decisions are administered by a security logic server implemented as an extension of the peripheral bus. Initial results show a minimal increase in resource overhead and no significant impact on the performance.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125538406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An FPGA Design Framework for CNN Sparsification and Acceleration CNN稀疏化和加速的FPGA设计框架
Sicheng Li, W. Wen, Yu Wang, Song Han, Yiran Chen, Hai Helen Li
{"title":"An FPGA Design Framework for CNN Sparsification and Acceleration","authors":"Sicheng Li, W. Wen, Yu Wang, Song Han, Yiran Chen, Hai Helen Li","doi":"10.1109/FCCM.2017.21","DOIUrl":"https://doi.org/10.1109/FCCM.2017.21","url":null,"abstract":"Convolutional neural networks (CNNs) have recently broken many performance records in image recognition and object detection problems. The success of CNNs, to a great extent, is enabled by the fast scaling-up of the networks that learn from a huge volume of data. The deployment of big CNN models can be both computation-intensive and memory-intensive, leaving severe challenges to hardware implementations. In recent years, sparsification techniques that prune redundant connections in the networks while still retaining the similar accuracy emerge as promising solutions to alliterate the computation overheads associated with CNNs [1]. However, imposing sparsity in CNNs usually generates random network connections and thus, the irregular data access pattern results in poor data locality. The low computation efficiency of the sparse networks, which is caused by the incurred unbalance in computing resource consumption and low memory bandwidth usage, significantly offsets the theocratical reduction of the computation complexity and limits the execution scalability of CNNs on general- purpose architectures [2]. For instance, as an important computation kernel in CNNs – the sparse convoluation, is usually accelerated by using data compression schemes where only nonzero elements of the kernel weights are stored and sent to multiplication-accumulation computations (MACs) at runtime. However, the relevant executions on CPUs and GPUs reach only 0.1% to 10% of the system peak performance even designated software libraries are applied (e.g., MKL library for CPUs and cuSPARSE library for GPUs). Field programmable gate arrays (FPGAs) have been also extensively studied as an important hardware platform for CNN computations [3]. Different from general-purpose architectures, FPGA allows users to customize the functions and organization of the designed hardware in order to adapt various resource needs and data usage patterns. This characteristic, as we identified in this work, can be leveraged to effectively overcome the main challenges in the execution of sparse CNNs through close coordinations between software and hardware. In particular, the reconfigurability of FPGA helps to 1) better map the sparse CNN onto the hardware for improving computation parallelism and execution efficiency and 2) eliminate the computation cost associated with zero weights and enhance data reuse to alleviate the adverse impacts of the irregular data accesses. In this work, we propose a hardware-software co-design framework to address the above challenges in sparse CNN accelerations. First, we introduce a data locality-aware sparsification scheme that optimizes the structure of the sparse CNN during training phase to make it friendly for hardware mapping. Both memory allocation and data access regularization are considered in the optimization process. Second, we develop a distributed architecture composed of the customized processing elements (PEs) that enables high computation parallelism","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116853219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs 基于fpga的卷积神经网络快速算法评估
Liqiang Lu, Yun Liang, Qingcheng Xiao, Shengen Yan
{"title":"Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs","authors":"Liqiang Lu, Yun Liang, Qingcheng Xiao, Shengen Yan","doi":"10.1109/FCCM.2017.64","DOIUrl":"https://doi.org/10.1109/FCCM.2017.64","url":null,"abstract":"In recent years, Convolutional Neural Networks (CNNs) have become widely adopted for computer vision tasks. FPGAs have been adequately explored as a promising hardware accelerator for CNNs due to its high performance, energy efficiency, and reconfigurability. However, prior FPGA solutions based on the conventional convolutional algorithm is often bounded by the computational capability of FPGAs (e.g., the number of DSPs). In this paper, we demonstrate that fast Winograd algorithm can dramatically reduce the arithmetic complexity, and improve the performance of CNNs on FPGAs. We first propose a novel architecture for implementing Winograd algorithm on FPGAs. Our design employs line buffer structure to effectively reuse the feature map data among different tiles. We also effectively pipeline the Winograd PE engine and initiate multiple PEs through parallelization. Meanwhile, there exists a complex design space to explore. We propose an analytical model to predict the resource usage and reason about the performance. Then, we use the model to guide a fast design space exploration. Experiments using the state-of-the-art CNNs demonstrate the best performance and energy efficiency on FPGAs. We achieve an average 1006.4 GOP/s for the convolutional layers and 854.6 GOP/s for the overall AlexNet and an average 3044.7 GOP/s for the convolutional layers and 2940.7 GOP/s for the overall VGG16 on Xilinx ZCU102 platform.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123682847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 254
Improved Synthesis of Compressor Trees on FPGAs in High-Level Synthesis 高级综合中fpga压缩树的改进综合
Le Tu, Yuelai Yuan, Kan Huang, Xiaoqiang Zhang, Zixin Wang, Dihu Chen
{"title":"Improved Synthesis of Compressor Trees on FPGAs in High-Level Synthesis","authors":"Le Tu, Yuelai Yuan, Kan Huang, Xiaoqiang Zhang, Zixin Wang, Dihu Chen","doi":"10.1109/FCCM.2017.11","DOIUrl":"https://doi.org/10.1109/FCCM.2017.11","url":null,"abstract":"In this paper, an approach to synthesize compressor trees in High-level Synthesis (HLS) for FPGAs is proposed. Our approach utilizes the bit-level information to improve the compressor tree synthesis. To obtain the bit-level information targeting compressor tree synthesis, a modified bitmask analysis technique based on prior work is proposed. A series of experimental results show that, compared to the existing heuristic, the average reductions of area and delay are 22.96% and 7.05%. The reductions increase to 29.97% and 9.07% respectively, when the carry chains in FPGAs are utilized to implement the compressor trees.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128224555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HLScope: High-Level Performance Debugging for FPGA Designs HLScope: FPGA设计的高级性能调试
Young-kyu Choi, J. Cong
{"title":"HLScope: High-Level Performance Debugging for FPGA Designs","authors":"Young-kyu Choi, J. Cong","doi":"10.1109/FCCM.2017.44","DOIUrl":"https://doi.org/10.1109/FCCM.2017.44","url":null,"abstract":"In their quest for further optimization, field-programmable gate array (FPGA) designers often spend considerable time trying to identify the performance bottleneck in a current design. But since FPGAs do not have built-in high-level probes for performance analysis, manual effort is required to insert custom hardware monitors. This, however, is a time-consuming process which calls for automation. Previous work automates the process of inserting hardware monitors into the communication channels or the finite-state machine, but the instrumentation is applied in low-level hardware description languages (HDL) which limits the comprehensibility in identifying the root cause of stalls. Instead, we propose a performance debugging methodology based on high-level synthesis (HLS). High-level analysis allows tracing the cause of stalls on a function or loop level, which provides a more intuitive feedback that can be used to pinpoint the performance bottleneck. In this paper we propose HLScope, a source-to-source transformation framework based on Vivado HLS for automated performance analysis. We present a method for analyzing the information collected from the software simulation to estimate the stall rate and its cause without the need for FPGA bitstream generation. For detailed analysis, an in-FPGA analysis method is proposed that can be natively integrated into the HLS environment. Experiments show that the parameter extraction from the simulation process is orders of magnitude faster than bitstream generation, with a 2.2% cycle difference on average. In-FPGA flow consumes only about 170 LUTs and a BRAM per monitored module and provides cycle-accurate results.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130658314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Exploration of FPGA-Based Packet Switches for Rack-Scale Computers on a Board 基于fpga的板上机架级计算机分组交换机的研究
J. H. Han, N. M. Bojan, A. Moore
{"title":"Exploration of FPGA-Based Packet Switches for Rack-Scale Computers on a Board","authors":"J. H. Han, N. M. Bojan, A. Moore","doi":"10.1109/FCCM.2017.35","DOIUrl":"https://doi.org/10.1109/FCCM.2017.35","url":null,"abstract":"This work explores the design space (bandwidthand port configuration) for an FPGA-based top-of-rack switchand, use our implementation, to provide an insight on which ofthese options is the best. We also propose an architecture fora rack-scale computer built on a printed circuit board (PCB) exploiting the FPGA-based switch.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130435129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CPRring: A Structure-Aware Ring-Based Checkpointing Architecture for FPGA Computing CPRring:一种结构感知的基于环的FPGA计算检查点架构
H. Vu, Shinya Takamaeda-Yamazaki, Takashi Nakada, Y. Nakashima
{"title":"CPRring: A Structure-Aware Ring-Based Checkpointing Architecture for FPGA Computing","authors":"H. Vu, Shinya Takamaeda-Yamazaki, Takashi Nakada, Y. Nakashima","doi":"10.1109/FCCM.2017.60","DOIUrl":"https://doi.org/10.1109/FCCM.2017.60","url":null,"abstract":"In this paper, we present a new architecture forFPGA checkpointing along with an efficient mechanism. Wethen provide a static analysis of original HDL source code toreduce the cost of hardware for checkpointing functionality. Ourevaluations show that with the proposals, checkpointing hardwarecauses small degradation in maximum clock frequency (less than10%). The LUT overhead varies from 14.4% (Dijkstra) to 103.84%(Matrix Multiplication).","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121677072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信