2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

High Level Synthesis Based E-Nose System for Gas Applications 基于高级合成的气体电子鼻系统

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.60

Amine Ait Si Ali, A. Amira, F. Bensaali, M. Benammar, Muhammad Hassan, A. Bermak

引用次数: 1

High-Throughput and Energy-Efficient Graph Processing on FPGA 基于FPGA的高吞吐量和高能效图形处理

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.35

Shijie Zhou, C. Chelmis, V. Prasanna

{"title":"High-Throughput and Energy-Efficient Graph Processing on FPGA","authors":"Shijie Zhou, C. Chelmis, V. Prasanna","doi":"10.1109/FCCM.2016.35","DOIUrl":"https://doi.org/10.1109/FCCM.2016.35","url":null,"abstract":"In this paper, we propose a novel design for large-scale graph processing on FPGA. Our design uses large external memory for storing massive graph data and FPGA for acceleration, and leverages edge-centric computing principles. We propose a data layout which optimizes the external memory performance and leads to an efficient memory activation schedule to reduce on-chip memory power consumption. Further, we develop a parallel architecture on FPGA which can saturate the external memory bandwidth and concurrently process multiple input data to increase throughput. We use our design to accelerate several classic graph algorithms, including single-source shortest path, weakly connected component, and minimum spanning tree. Experimental results show that for all the considered graph algorithms, our design achieves high throughput of over 600 million traversed edges per second (MTEPS) and high energy-efficiency of over 30 MTEPS/W. Compared with a baseline design, our optimizations result in over 3.6× throughput and 5.8× energy-efficiency improvements, respectively. Our design achieves 32% throughput improvement when compared with state-of-the-art FPGA designs, and up to 7.8× speedup when compared with state-of-the-art multi-core implementation.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"155 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116878436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

A Content Adapted FPGA Memory Architecture with Pattern Recognition Capability for L1 Track Triggering in the LHC Environment LHC环境下具有L1轨迹触发模式识别能力的FPGA内存结构

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.52

T. Harbaum, M. Seboui, M. Balzer, J. Becker, M. Weber

{"title":"A Content Adapted FPGA Memory Architecture with Pattern Recognition Capability for L1 Track Triggering in the LHC Environment","authors":"T. Harbaum, M. Seboui, M. Balzer, J. Becker, M. Weber","doi":"10.1109/FCCM.2016.52","DOIUrl":"https://doi.org/10.1109/FCCM.2016.52","url":null,"abstract":"Modern high-energy physics experiments such as the Compact Muon Solenoid experiment at CERN produce an extraordinary amount of data every 25ns. To handle a data rate of more than 50Tbit/s a multi-level trigger system is required, which reduces the data rate. Due to the increased luminosity after the Phase-II-Upgrade of the LHC, the CMS tracking system has to be redesigned. The current trigger system is unable to handle the resulting amount of data after this upgrade. Because of the latency of a few microseconds the Level 1 Track Trigger has to be implemented in hardware. State-of-the-art pattern recognition filter the incoming data by template matching on ASICs with a content addressable memory architecture. An implementation on an FPGA, which replaces the content addressable memory of the ASIC, has not been possible so far. This paper presents a new approach to a content addressable memory architecture, which allows an implementation of an FPGA based design. By combining filtering and track finding on an FPGA design, there are many possibilities of adjusting the two algorithms to each other. There is more flexibility enabled by the FPGA architecture in contrast to the ASIC. The presented design minimizes the stored data by logic to optimally utilize the available resources of an FPGA. Furthermore, the developed design meets the strong timing constraints and possesses the required properties of the content addressable memory.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125559036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

P4-to-VHDL: Automatic Generation of 100 Gbps Packet Parsers p4到vhdl: 100 Gbps包解析器的自动生成

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.46

Pavel Benácek, V. Pus, H. Kubátová

{"title":"P4-to-VHDL: Automatic Generation of 100 Gbps Packet Parsers","authors":"Pavel Benácek, V. Pus, H. Kubátová","doi":"10.1109/FCCM.2016.46","DOIUrl":"https://doi.org/10.1109/FCCM.2016.46","url":null,"abstract":"Software Defined Networking and OpenFlow offer an elegant way to decouple network control plane from data plane. This decoupling has led to great innovation in the control plane, yet the data plane changes come at much slower pace, mainly due to the hard-wired implementation of network switches. The P4 language aims to overcome this obstacle by providing a description of a customized packet processing functionality for configurable switches. That enables a new generation of possibly heterogeneous networking hardware that can be runtime tailored for the needs of particular applications from various domains. In this paper we contribute to the idea of P4 by presenting design, analysis and experimental results of our packet parser generator. The generator converts a parse graph description of P4 to a synthetizable VHDL code suitable for FPGA implementation. Our results show that the generated circuit is able to parse 100 Gbps traffic with fairly complex protocol structure at line rate on a Xilinx Virtex-7 FPGA. The approach can be used not only in switches, but also in other appliances, such as application accelerators and smart NICs. We compare the generated output to a hand-written parser to show that the price for configurability is only a slightly larger and slower circuit.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122825303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform CPU-FPGA异构平台上的等价联接加速

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.62

Ren Chen, V. Prasanna

{"title":"Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform","authors":"Ren Chen, V. Prasanna","doi":"10.1109/FCCM.2016.62","DOIUrl":"https://doi.org/10.1109/FCCM.2016.62","url":null,"abstract":"Accelerating database applications using FPGAs has recently been an area of growing interest in both academia and industry. Equi-join is one of the key database operations whose performance highly depends on sorting, which exhibits high memory usage on FPGA. A fully pipelined N-key merge sorter consists of log N sorting stages using O(N) memory totally. For large data sets, external memory has to be employed to perform data buffering between the sorting stages. This introduces pipeline stalls as well as several iterations between FPGA and external memory, causing significant performance degradation. In this paper, we speed-up equi-join using a hybrid CPU-FPGA heterogeneous platform. To alleviate the performance impact of limited memory, we propose a merge sort based hybrid design where the first few sorting stages in the merge sort tree are replaced with \"folded\" bitonic sorting networks. These \"folded\" bitonic sorting networks operate in parallel on the FPGA. The partial results are then merged on the CPU to produce the final sorted result. Based on this hybrid sorting design, we develop two streaming join algorithms by optimizing the classic CPU-based nested-loop join and sort-merge join algorithms. On a rangeof data set sizes, our design achieves throughput improvement of 3.1x and 1.9x compared with software-only and FPGA only implementations, respectively. Our design sustains 21.6% of thepeak bandwidth, which is 3.9x utilization obtained by the state-of-the-art FPGA equi-join implementation.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121988909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Acceleration of the Pair-HMM Algorithm for DNA Variant Calling DNA变异调用的Pair-HMM算法加速

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1145/3020078.3021749

Sitao Huang, G. Manikandan, Anand Ramachandran, K. Rupnow, Wen-mei W. Hwu, Deming Chen

引用次数: 46

Runtime Parameterizable Regular Expression Operators for Databases 数据库的运行时可参数化正则表达式运算符

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.61

Z. István, David Sidler, G. Alonso

{"title":"Runtime Parameterizable Regular Expression Operators for Databases","authors":"Z. István, David Sidler, G. Alonso","doi":"10.1109/FCCM.2016.61","DOIUrl":"https://doi.org/10.1109/FCCM.2016.61","url":null,"abstract":"Relational databases execute user queries through operator trees, where each operator has a well defined interface and a specific task (e.g., arithmetic function, pattern matching, aggregation, etc.). Hardware acceleration of compute intensive operators is a promising prospect but it comes with challenges. Databases execute tens of thousands of different queries per second. Thus, if only one specific instantiation of an operator is supported by the accelerator, it will have little effect on the overall workload. In this paper we explore the tradeoff between resource efficiency and expression complexity for an FPGA accelerator targeting string-matching operators (LIKE and REGEXP_LIKE in SQL). This tradeoff is complex. For instance, the FPGA not always wins: simple queries that can be answered from indexes run faster on the CPU. On complex regular expressions, the FPGA is faster but needs to be parametrized at runtime to be able to support different queries. For very long patterns, the entire expression might not fit into the FPGA circuit and a combined mode CPU-FPGA must be chosen. We evaluate our design on a heterogeneous multi-core machine in which the FPGA has cache coherent access to the CPU memory. In addition to the string matching circuit, we also show how to implement database page parsing logic so as to be able to work directly on the same memory data structures as the database engine.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115854688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Knowledge Transfer in Automatic Optimisation of Reconfigurable Designs 可重构设计自动优化中的知识转移

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.29

Maciej Kurek, M. Deisenroth, W. Luk, T. Todman

引用次数: 9

FPGA-Based Reduction Techniques for Efficient Deep Neural Network Deployment 基于fpga的高效深度神经网络部署约简技术

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.58

A. Page, T. Mohsenin

{"title":"FPGA-Based Reduction Techniques for Efficient Deep Neural Network Deployment","authors":"A. Page, T. Mohsenin","doi":"10.1109/FCCM.2016.58","DOIUrl":"https://doi.org/10.1109/FCCM.2016.58","url":null,"abstract":"Deep neural networks have been shown to outperform prior state-of-the-art solutions that often relied heavily on hand-engineered feature extraction techniques coupled with simple classification algorithms. In particular, deep max-pooling convolutional neural networks (MPCNN) have been shown to dominate on several popular public benchmarks. Unfortunately, the benefits of deep networks have yet to be exploited in embedded, resource-bound settings that have strict power and area budgets. GPUs have been shown to improve throughput and energy-efficiency over CPUs due to their parallel architecture. In a similar fashion, FPGAs can improve performance while allowing more fine control over implementation. In order to meet power, area, and latency constraints, it is necessary to develop network reduction strategies in addition to optimal mapping. This work looks at two specific reduction techniques including limited precision for both fixed-point and floating-point formats, and performing weight matrix truncation using singular value decomposition. An FPGA-based framework is also proposed and used to deploy the trained networks. To demonstrate, a handful of public computer vision datasets including MNIST, CIFAR-10, and SVHN are fully implemented on a low-power Xilinx Artix-7 FPGA. Experimental results show that all networks are able to achieve a classification throughput of 16 img/sec and consume less than 700 mW when running at 200 MHz. In addition, the reduced networks are able to, on average, reduce power and area utilization by 37% and 44%, respectively, while only incurring less than 0.20% decrease in accuracy.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"658 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132093276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Parallelizing FPGA Technology Mapping through Partitioning 通过分区并行化FPGA技术映射

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.48

Chuyu Shen, Zili Lin, Ping Fan, X. Meng, Weikang Qian

引用次数: 4