2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

筛选
英文 中文
High Level Synthesis Based E-Nose System for Gas Applications 基于高级合成的气体电子鼻系统
Amine Ait Si Ali, A. Amira, F. Bensaali, M. Benammar, Muhammad Hassan, A. Bermak
{"title":"High Level Synthesis Based E-Nose System for Gas Applications","authors":"Amine Ait Si Ali, A. Amira, F. Bensaali, M. Benammar, Muhammad Hassan, A. Bermak","doi":"10.1109/FCCM.2016.60","DOIUrl":"https://doi.org/10.1109/FCCM.2016.60","url":null,"abstract":"This paper proposes a hardware/software co-design approach using the Zynq platform for the implementation of an electronic nose (EN) system based on principal component analysis (PCA) as a dimensionality reduction technique and decision tree (DT) as a classification algorithm using a 4x4 in-house fabricated sensor. The system was successfully trained and simulated in MATLAB environment prior to the implementation on the Zynq platform. High level synthesis was carried out on the proposed designs using different optimization directives including loop unrolling, array partitioning and pipelining.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124418444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-Throughput and Energy-Efficient Graph Processing on FPGA 基于FPGA的高吞吐量和高能效图形处理
Shijie Zhou, C. Chelmis, V. Prasanna
{"title":"High-Throughput and Energy-Efficient Graph Processing on FPGA","authors":"Shijie Zhou, C. Chelmis, V. Prasanna","doi":"10.1109/FCCM.2016.35","DOIUrl":"https://doi.org/10.1109/FCCM.2016.35","url":null,"abstract":"In this paper, we propose a novel design for large-scale graph processing on FPGA. Our design uses large external memory for storing massive graph data and FPGA for acceleration, and leverages edge-centric computing principles. We propose a data layout which optimizes the external memory performance and leads to an efficient memory activation schedule to reduce on-chip memory power consumption. Further, we develop a parallel architecture on FPGA which can saturate the external memory bandwidth and concurrently process multiple input data to increase throughput. We use our design to accelerate several classic graph algorithms, including single-source shortest path, weakly connected component, and minimum spanning tree. Experimental results show that for all the considered graph algorithms, our design achieves high throughput of over 600 million traversed edges per second (MTEPS) and high energy-efficiency of over 30 MTEPS/W. Compared with a baseline design, our optimizations result in over 3.6× throughput and 5.8× energy-efficiency improvements, respectively. Our design achieves 32% throughput improvement when compared with state-of-the-art FPGA designs, and up to 7.8× speedup when compared with state-of-the-art multi-core implementation.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"155 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116878436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
A Content Adapted FPGA Memory Architecture with Pattern Recognition Capability for L1 Track Triggering in the LHC Environment LHC环境下具有L1轨迹触发模式识别能力的FPGA内存结构
T. Harbaum, M. Seboui, M. Balzer, J. Becker, M. Weber
{"title":"A Content Adapted FPGA Memory Architecture with Pattern Recognition Capability for L1 Track Triggering in the LHC Environment","authors":"T. Harbaum, M. Seboui, M. Balzer, J. Becker, M. Weber","doi":"10.1109/FCCM.2016.52","DOIUrl":"https://doi.org/10.1109/FCCM.2016.52","url":null,"abstract":"Modern high-energy physics experiments such as the Compact Muon Solenoid experiment at CERN produce an extraordinary amount of data every 25ns. To handle a data rate of more than 50Tbit/s a multi-level trigger system is required, which reduces the data rate. Due to the increased luminosity after the Phase-II-Upgrade of the LHC, the CMS tracking system has to be redesigned. The current trigger system is unable to handle the resulting amount of data after this upgrade. Because of the latency of a few microseconds the Level 1 Track Trigger has to be implemented in hardware. State-of-the-art pattern recognition filter the incoming data by template matching on ASICs with a content addressable memory architecture. An implementation on an FPGA, which replaces the content addressable memory of the ASIC, has not been possible so far. This paper presents a new approach to a content addressable memory architecture, which allows an implementation of an FPGA based design. By combining filtering and track finding on an FPGA design, there are many possibilities of adjusting the two algorithms to each other. There is more flexibility enabled by the FPGA architecture in contrast to the ASIC. The presented design minimizes the stored data by logic to optimally utilize the available resources of an FPGA. Furthermore, the developed design meets the strong timing constraints and possesses the required properties of the content addressable memory.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125559036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
P4-to-VHDL: Automatic Generation of 100 Gbps Packet Parsers p4到vhdl: 100 Gbps包解析器的自动生成
Pavel Benácek, V. Pus, H. Kubátová
{"title":"P4-to-VHDL: Automatic Generation of 100 Gbps Packet Parsers","authors":"Pavel Benácek, V. Pus, H. Kubátová","doi":"10.1109/FCCM.2016.46","DOIUrl":"https://doi.org/10.1109/FCCM.2016.46","url":null,"abstract":"Software Defined Networking and OpenFlow offer an elegant way to decouple network control plane from data plane. This decoupling has led to great innovation in the control plane, yet the data plane changes come at much slower pace, mainly due to the hard-wired implementation of network switches. The P4 language aims to overcome this obstacle by providing a description of a customized packet processing functionality for configurable switches. That enables a new generation of possibly heterogeneous networking hardware that can be runtime tailored for the needs of particular applications from various domains. In this paper we contribute to the idea of P4 by presenting design, analysis and experimental results of our packet parser generator. The generator converts a parse graph description of P4 to a synthetizable VHDL code suitable for FPGA implementation. Our results show that the generated circuit is able to parse 100 Gbps traffic with fairly complex protocol structure at line rate on a Xilinx Virtex-7 FPGA. The approach can be used not only in switches, but also in other appliances, such as application accelerators and smart NICs. We compare the generated output to a hand-written parser to show that the price for configurability is only a slightly larger and slower circuit.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122825303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform CPU-FPGA异构平台上的等价联接加速
Ren Chen, V. Prasanna
{"title":"Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform","authors":"Ren Chen, V. Prasanna","doi":"10.1109/FCCM.2016.62","DOIUrl":"https://doi.org/10.1109/FCCM.2016.62","url":null,"abstract":"Accelerating database applications using FPGAs has recently been an area of growing interest in both academia and industry. Equi-join is one of the key database operations whose performance highly depends on sorting, which exhibits high memory usage on FPGA. A fully pipelined N-key merge sorter consists of log N sorting stages using O(N) memory totally. For large data sets, external memory has to be employed to perform data buffering between the sorting stages. This introduces pipeline stalls as well as several iterations between FPGA and external memory, causing significant performance degradation. In this paper, we speed-up equi-join using a hybrid CPU-FPGA heterogeneous platform. To alleviate the performance impact of limited memory, we propose a merge sort based hybrid design where the first few sorting stages in the merge sort tree are replaced with \"folded\" bitonic sorting networks. These \"folded\" bitonic sorting networks operate in parallel on the FPGA. The partial results are then merged on the CPU to produce the final sorted result. Based on this hybrid sorting design, we develop two streaming join algorithms by optimizing the classic CPU-based nested-loop join and sort-merge join algorithms. On a rangeof data set sizes, our design achieves throughput improvement of 3.1x and 1.9x compared with software-only and FPGA only implementations, respectively. Our design sustains 21.6% of thepeak bandwidth, which is 3.9x utilization obtained by the state-of-the-art FPGA equi-join implementation.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121988909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Acceleration of the Pair-HMM Algorithm for DNA Variant Calling DNA变异调用的Pair-HMM算法加速
Sitao Huang, G. Manikandan, Anand Ramachandran, K. Rupnow, Wen-mei W. Hwu, Deming Chen
{"title":"Acceleration of the Pair-HMM Algorithm for DNA Variant Calling","authors":"Sitao Huang, G. Manikandan, Anand Ramachandran, K. Rupnow, Wen-mei W. Hwu, Deming Chen","doi":"10.1145/3020078.3021749","DOIUrl":"https://doi.org/10.1145/3020078.3021749","url":null,"abstract":"In this project, we propose an SoC solution to accelerate the Pair-HMM's forward algorithm which is the key performance bottleneck in the GATK's HaplotypeCaller tool for DNA variant calling. We develop two versions of the Pair-HMM accelerator: one using High Level Synthesis (HLS), and another ring-based manual RTL implementation. We investigate the performance of the manual RTL design and HLS design in terms of design flexibility and overall run-time. We achieve a significant speed-up of up to 19x through the HLS implementation and speed-up of up to 95x through the RTL implementation of the algorithm.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124159862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Runtime Parameterizable Regular Expression Operators for Databases 数据库的运行时可参数化正则表达式运算符
Z. István, David Sidler, G. Alonso
{"title":"Runtime Parameterizable Regular Expression Operators for Databases","authors":"Z. István, David Sidler, G. Alonso","doi":"10.1109/FCCM.2016.61","DOIUrl":"https://doi.org/10.1109/FCCM.2016.61","url":null,"abstract":"Relational databases execute user queries through operator trees, where each operator has a well defined interface and a specific task (e.g., arithmetic function, pattern matching, aggregation, etc.). Hardware acceleration of compute intensive operators is a promising prospect but it comes with challenges. Databases execute tens of thousands of different queries per second. Thus, if only one specific instantiation of an operator is supported by the accelerator, it will have little effect on the overall workload. In this paper we explore the tradeoff between resource efficiency and expression complexity for an FPGA accelerator targeting string-matching operators (LIKE and REGEXP_LIKE in SQL). This tradeoff is complex. For instance, the FPGA not always wins: simple queries that can be answered from indexes run faster on the CPU. On complex regular expressions, the FPGA is faster but needs to be parametrized at runtime to be able to support different queries. For very long patterns, the entire expression might not fit into the FPGA circuit and a combined mode CPU-FPGA must be chosen. We evaluate our design on a heterogeneous multi-core machine in which the FPGA has cache coherent access to the CPU memory. In addition to the string matching circuit, we also show how to implement database page parsing logic so as to be able to work directly on the same memory data structures as the database engine.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115854688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Knowledge Transfer in Automatic Optimisation of Reconfigurable Designs 可重构设计自动优化中的知识转移
Maciej Kurek, M. Deisenroth, W. Luk, T. Todman
{"title":"Knowledge Transfer in Automatic Optimisation of Reconfigurable Designs","authors":"Maciej Kurek, M. Deisenroth, W. Luk, T. Todman","doi":"10.1109/FCCM.2016.29","DOIUrl":"https://doi.org/10.1109/FCCM.2016.29","url":null,"abstract":"This paper presents a novel approach for automatic optimisation of reconfigurable design parameters based on knowledge transfer. The key idea is to make use of insights derived from optimising related designs to benefit future optimisations. We show how to use designs targeting one device to speed up optimisation of another device. The proposed approach is evaluated based on various applications including computational finance and seismic imaging. It is capable of achieving up to 35% reduction in optimisation time in producing designs with similar performance, compared to alternative optimisation methods.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133370593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
FPGA-Based Reduction Techniques for Efficient Deep Neural Network Deployment 基于fpga的高效深度神经网络部署约简技术
A. Page, T. Mohsenin
{"title":"FPGA-Based Reduction Techniques for Efficient Deep Neural Network Deployment","authors":"A. Page, T. Mohsenin","doi":"10.1109/FCCM.2016.58","DOIUrl":"https://doi.org/10.1109/FCCM.2016.58","url":null,"abstract":"Deep neural networks have been shown to outperform prior state-of-the-art solutions that often relied heavily on hand-engineered feature extraction techniques coupled with simple classification algorithms. In particular, deep max-pooling convolutional neural networks (MPCNN) have been shown to dominate on several popular public benchmarks. Unfortunately, the benefits of deep networks have yet to be exploited in embedded, resource-bound settings that have strict power and area budgets. GPUs have been shown to improve throughput and energy-efficiency over CPUs due to their parallel architecture. In a similar fashion, FPGAs can improve performance while allowing more fine control over implementation. In order to meet power, area, and latency constraints, it is necessary to develop network reduction strategies in addition to optimal mapping. This work looks at two specific reduction techniques including limited precision for both fixed-point and floating-point formats, and performing weight matrix truncation using singular value decomposition. An FPGA-based framework is also proposed and used to deploy the trained networks. To demonstrate, a handful of public computer vision datasets including MNIST, CIFAR-10, and SVHN are fully implemented on a low-power Xilinx Artix-7 FPGA. Experimental results show that all networks are able to achieve a classification throughput of 16 img/sec and consume less than 700 mW when running at 200 MHz. In addition, the reduced networks are able to, on average, reduce power and area utilization by 37% and 44%, respectively, while only incurring less than 0.20% decrease in accuracy.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"658 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132093276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Parallelizing FPGA Technology Mapping through Partitioning 通过分区并行化FPGA技术映射
Chuyu Shen, Zili Lin, Ping Fan, X. Meng, Weikang Qian
{"title":"Parallelizing FPGA Technology Mapping through Partitioning","authors":"Chuyu Shen, Zili Lin, Ping Fan, X. Meng, Weikang Qian","doi":"10.1109/FCCM.2016.48","DOIUrl":"https://doi.org/10.1109/FCCM.2016.48","url":null,"abstract":"The traditional FPGA technology mapping flow is very time-consuming as modern FPGA designs become larger. To speed up this procedure, this paper proposes a new approach based on circuit partition to parallelize it. The idea is to split the original circuit into several sub-circuits and assign each one to a core of a multi-core processor for simultaneous technology mapping. Compared to other existing parallelization methods, our method has the benefit of being independent of the detailed mapping algorithm. Our proposed partition method is able to minimize the quality loss caused by the partitioning. We have successfully integrated the proposed approach into an industrial FPGA mapping platform. The proposed flow gains a speed-up of 1.6X on average on a quad-core processor with negligible influence on the LUT count and the critical path length.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115972941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信