2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

筛选
英文 中文
Sorting Large Data Sets with FPGA-Accelerated Samplesort 用fpga加速采样排序大型数据集
Han Chen, S. Madaminov, M. Ferdman, Peter Milder
{"title":"Sorting Large Data Sets with FPGA-Accelerated Samplesort","authors":"Han Chen, S. Madaminov, M. Ferdman, Peter Milder","doi":"10.1109/FCCM.2019.00067","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00067","url":null,"abstract":"Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping \"buckets,\" sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121955760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Improved Techniques for Sensing Intra-Device Side Channel Leakage 器件内侧通道泄漏检测的改进技术
William Hunter, Christopher McCarty, L. Lerner
{"title":"Improved Techniques for Sensing Intra-Device Side Channel Leakage","authors":"William Hunter, Christopher McCarty, L. Lerner","doi":"10.1109/FCCM.2019.00069","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00069","url":null,"abstract":"Side channels which introduce intra-device circuit module information leakage or functional influence are of concern for the security and trust of many applications, such as multi-tenant and multi-level security single FPGA designs. Previous works utilized a sensor co-located on the same FPGA with a target module which was able to detect side channel voltage variations. We build on this by creating a sensor with more programmability and sensitivity resulting in improved recovery of bit patterns from an isolated target. We demonstrate for the first time the recovery of an unknown target frequency and data pattern length in a multi-user FPGA side channel attack. We also show increased sensitivity over previously developed voltage sensors enabling data recovery with fewer samples.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127591151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OpenCL Kernel Vectorization on the CPU, GPU, and FPGA: A Case Study with Frequent Pattern Compression 在CPU, GPU和FPGA上的OpenCL内核矢量化:频繁模式压缩的案例研究
Zheming Jin, H. Finkel
{"title":"OpenCL Kernel Vectorization on the CPU, GPU, and FPGA: A Case Study with Frequent Pattern Compression","authors":"Zheming Jin, H. Finkel","doi":"10.1109/FCCM.2019.00071","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00071","url":null,"abstract":"OpenCL promotes code portability, and natively supports vectorized data types, which allows developers to potentially take advantage of the single-instruction-multiple-data instructions on CPUs, GPUs, and FPGAs. FPGAs are becoming a promising heterogeneous computing component. In our study, we choose a kernel used in frequent pattern compression as a case study of OpenCL kernel vectorizations on the three computing platforms. We describe different pattern matching approaches for the kernel, and manually vectorize the OpenCL kernel by a factor ranging from 2 to 16. We evaluate the kernel on an Intel Xeon 16-core CPU, an NVIDIA P100 GPU, and a Nallatech 385A FPGA card featuring an Intel Arria 10 GX1150 FPGA. Compared to the optimized kernel that is not vectorized, our vectorization can improve the kernel performance by a factor of 16 on the FPGA. The performance improvement ranges from 1 to 11.4 on the CPU, and from 1.02 to 9.3 on the GPU. The effectiveness of kernel vectorization depends on the work-group size.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129701274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design Patterns for Code Reuse in HLS Packet Processing Pipelines HLS包处理管道中代码重用的设计模式
Haggai Eran, Lior Zeno, Z. István, M. Silberstein
{"title":"Design Patterns for Code Reuse in HLS Packet Processing Pipelines","authors":"Haggai Eran, Lior Zeno, Z. István, M. Silberstein","doi":"10.1109/FCCM.2019.00036","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00036","url":null,"abstract":"High-level synthesis (HLS) allows developers to be more productive in designing FPGA circuits thanks to familiar programming languages and high-level abstractions. In order to create high-performance circuits, HLS tools, such as Xilinx Vivado HLS, require following specific design patterns and techniques. Unfortunately, when applied to network packet processing tasks, these techniques limit code reuse and modularity, requiring developers to use deprecated programming conventions. We propose a methodology for developing high-speed networking applications using Vivado HLS for C++, focusing on reusability, code simplicity, and overall performance. Following this methodology, we implement a class library (ntl) with several building blocks that can be used in a wide spectrum of networking applications. We evaluate the methodology by implementing two applications: a UDP stateless firewall and a key-value store cache designed for FPGA-based SmartNICs, both processing packets at 40Gbps line-rate.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129755530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
[Publisher's information] (发布者的信息)
{"title":"[Publisher's information]","authors":"","doi":"10.1109/fccm.2019.00081","DOIUrl":"https://doi.org/10.1109/fccm.2019.00081","url":null,"abstract":"","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124414136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Active Stereo Vision with High Resolution on an FPGA 基于FPGA的高分辨率主动立体视觉
Marc Pfeifer, P. Scholl, R. Voigt, B. Becker
{"title":"Active Stereo Vision with High Resolution on an FPGA","authors":"Marc Pfeifer, P. Scholl, R. Voigt, B. Becker","doi":"10.1109/FCCM.2019.00026","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00026","url":null,"abstract":"We present a novel FPGA based active stereo vision system, tailored for the use in a mobile 3D stereo camera. For the generation of a single 3D map the matching algorithm is based on a correlation approach, where multiple stereo image pairs instead of a single one are processed to guarantee an improved depth resolution. To efficiently handle the large amounts of incoming image data we adapt the algorithm to the underlying FPGA structures, e.g. by making use of pipelining and parallelization.Experiments demonstrate that our approach provides high-quality 3D maps at least three times more energy-efficient (5.5 fps/W) than comparable approaches executed on CPU and GPU platforms. Implemented on a Xilinx Zynq-7030 SoC our system provides a computation speed of 12.2 fps, at a resolution of 1.3 megapixel and a 128 pixel disparity search space. As such it outperforms the currently best passive stereo systems of the Middlebury Stereo Evaluation in terms of speed and accuracy. The presented approach is therefore well suited for mobile applications, that require a highly accurate and energy-efficient active stereo vision system.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129418280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU 基因组测序中长读对重叠的硬件加速:FPGA和GPU之间的竞争
Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, J. Cong
{"title":"Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU","authors":"Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, J. Cong","doi":"10.1109/FCCM.2019.00027","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00027","url":null,"abstract":"In genome sequencing, it is a crucial but time-consuming task to detect potential overlaps between any pair of the input reads, especially those that are ultra-long. The state-of-the-art overlapping tool Minimap2 outperforms other popular tools in speed and accuracy. It has a single computing hot-spot, chaining, that takes 70% of the time and needs to be accelerated. There are several crucial issues for hardware acceleration because of the nature of chaining. First, the original computation pattern is poorly parallelizable and a direct implementation will result in low utilization of parallel processing units. We propose a method to reorder the operation sequence that transforms the algorithm into a hardware-friendly form. Second, the large but variable sizes of input data make it hard to leverage task-level parallelism. Therefore, we customize a fine-grained task dispatching scheme which could keep parallel PEs busy while satisfying the on-chip memory restriction. Based on these optimizations, we map the algorithm to a fully pipelined streaming architecture on FPGA using HLS, which achieves significant performance improvement. The principles of our acceleration design apply to both FPGA and GPU. Compared to the multi-threading CPU baseline, our GPU accelerator achieves 7x acceleration, while our FPGA accelerator achieves 28x acceleration. We further conduct an architecture study to quantitatively analyze the architectural reason for the performance difference. The summarized insights could serve as a guide on choosing the proper hardware acceleration platform.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"41 13","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113936078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Yosys+nextpnr: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs Yosys+ nextpr:从Verilog到Bitstream的商业fpga开源框架
David Shah, Eddie Hung, C. Wolf, Serge Bazanski, D. Gisselquist, Miodrag Milanovic
{"title":"Yosys+nextpnr: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs","authors":"David Shah, Eddie Hung, C. Wolf, Serge Bazanski, D. Gisselquist, Miodrag Milanovic","doi":"10.1109/FCCM.2019.00010","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00010","url":null,"abstract":"This paper introduces a fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation. Currently, this flow supports two commercially available FPGA families, Lattice iCE40 (up to 8K logic elements) and Lattice ECP5 (up to 85K elements) and has been hardware-proven for custom-computing machines including a low-power neural-network accelerator and an OpenRISC system-on-chip capable of booting Linux. Both Yosys and nextpnr have been engineered in a highly flexible manner to support many of the features present in modern FPGAs by separating architecture-specific details from the common mapping algorithms.This framework is demonstrated on a longest-path case study to find an atypical single source-sink path occupying up to 45% of all on-chip wiring.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123243337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Cost-Effective Energy Monitoring of a Zynq-Based Real-Time System Including Dual Gigabit Ethernet 基于zynq的双千兆以太网实时系统的高性价比能源监测
M. Geier, Dominik Faller, Marian Brändle, S. Chakraborty
{"title":"Cost-Effective Energy Monitoring of a Zynq-Based Real-Time System Including Dual Gigabit Ethernet","authors":"M. Geier, Dominik Faller, Marian Brändle, S. Chakraborty","doi":"10.1109/FCCM.2019.00068","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00068","url":null,"abstract":"Recent FPGA architectures integrate various power management features already established in CPU-driven SoCs to reach more energy-sensitive application domains such as, e.g., automotive and robotics. This also qualifies hybrid Programmable SoCs (pSoCs) that combine fixed-function SoCs with configurable FPGA fabric for heterogeneous Real-time Systems (RTSs), which operate under predefined latency and power constraints in safety-critical environments. Their complex application-specific computation and communication (incl. I/O) architectures result in highly varying power consumption, which requires precise voltage and current sensing on all relevant supply rails to enable dependable evaluation of available and novel power management techniques. In this paper, we propose a low-cost 18-channel 16-bit-resolution measurement system capable of over 200 kSPS (kilo-samples per second) for instrumentation of current pSoC development boards. In addition, we propose to include crucial I/O components such as Ethernet PHYs into the power monitoring to gain a holistic view on the RTS's temporal behavior covering not only computation on FPGA and CPUs, but also communication in terms of, e.g., reception of sensor values and transmission of actuation signals. We present an FMC-sized implementation of our measurement system combined with two Gigabit Ethernet PHYs and one HDMI input. Paired with Xilinx' ZC702 development board, we are able to synchronously acquire power traces of a Zynq pSoC and the two PHYs precise enough to identify individual Ethernet frames.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124517171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Module-per-Object: A Human-Driven Methodology for C++-Based High-Level Synthesis Design 面向对象的模块:基于c++的高级综合设计的人为驱动方法
Jeferson Santiago da Silva, F. Boyer, J. Langlois
{"title":"Module-per-Object: A Human-Driven Methodology for C++-Based High-Level Synthesis Design","authors":"Jeferson Santiago da Silva, F. Boyer, J. Langlois","doi":"10.1109/FCCM.2019.00037","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00037","url":null,"abstract":"High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to handwritten VHDL code while keeping a high abstraction level, humanreadable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parameterization while eliminating the incidence of code duplication.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"09 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134523069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信