2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献_第7页

Sorting Large Data Sets with FPGA-Accelerated Samplesort 用fpga加速采样排序大型数据集

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00067

Han Chen, S. Madaminov, M. Ferdman, Peter Milder

{"title":"Sorting Large Data Sets with FPGA-Accelerated Samplesort","authors":"Han Chen, S. Madaminov, M. Ferdman, Peter Milder","doi":"10.1109/FCCM.2019.00067","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00067","url":null,"abstract":"Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping \"buckets,\" sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121955760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Improved Techniques for Sensing Intra-Device Side Channel Leakage 器件内侧通道泄漏检测的改进技术

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00069

William Hunter, Christopher McCarty, L. Lerner

引用次数: 0

OpenCL Kernel Vectorization on the CPU, GPU, and FPGA: A Case Study with Frequent Pattern Compression 在CPU, GPU和FPGA上的OpenCL内核矢量化:频繁模式压缩的案例研究

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00071

Zheming Jin, H. Finkel

引用次数: 0

Design Patterns for Code Reuse in HLS Packet Processing Pipelines HLS包处理管道中代码重用的设计模式

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00036

Haggai Eran, Lior Zeno, Z. István, M. Silberstein

引用次数: 9

[Publisher's information] (发布者的信息)

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/fccm.2019.00081

引用次数: 0

Active Stereo Vision with High Resolution on an FPGA 基于FPGA的高分辨率主动立体视觉

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00026

Marc Pfeifer, P. Scholl, R. Voigt, B. Becker

{"title":"Active Stereo Vision with High Resolution on an FPGA","authors":"Marc Pfeifer, P. Scholl, R. Voigt, B. Becker","doi":"10.1109/FCCM.2019.00026","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00026","url":null,"abstract":"We present a novel FPGA based active stereo vision system, tailored for the use in a mobile 3D stereo camera. For the generation of a single 3D map the matching algorithm is based on a correlation approach, where multiple stereo image pairs instead of a single one are processed to guarantee an improved depth resolution. To efficiently handle the large amounts of incoming image data we adapt the algorithm to the underlying FPGA structures, e.g. by making use of pipelining and parallelization.Experiments demonstrate that our approach provides high-quality 3D maps at least three times more energy-efficient (5.5 fps/W) than comparable approaches executed on CPU and GPU platforms. Implemented on a Xilinx Zynq-7030 SoC our system provides a computation speed of 12.2 fps, at a resolution of 1.3 megapixel and a 128 pixel disparity search space. As such it outperforms the currently best passive stereo systems of the Middlebury Stereo Evaluation in terms of speed and accuracy. The presented approach is therefore well suited for mobile applications, that require a highly accurate and energy-efficient active stereo vision system.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129418280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU 基因组测序中长读对重叠的硬件加速:FPGA和GPU之间的竞争

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00027

Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, J. Cong

{"title":"Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU","authors":"Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, J. Cong","doi":"10.1109/FCCM.2019.00027","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00027","url":null,"abstract":"In genome sequencing, it is a crucial but time-consuming task to detect potential overlaps between any pair of the input reads, especially those that are ultra-long. The state-of-the-art overlapping tool Minimap2 outperforms other popular tools in speed and accuracy. It has a single computing hot-spot, chaining, that takes 70% of the time and needs to be accelerated. There are several crucial issues for hardware acceleration because of the nature of chaining. First, the original computation pattern is poorly parallelizable and a direct implementation will result in low utilization of parallel processing units. We propose a method to reorder the operation sequence that transforms the algorithm into a hardware-friendly form. Second, the large but variable sizes of input data make it hard to leverage task-level parallelism. Therefore, we customize a fine-grained task dispatching scheme which could keep parallel PEs busy while satisfying the on-chip memory restriction. Based on these optimizations, we map the algorithm to a fully pipelined streaming architecture on FPGA using HLS, which achieves significant performance improvement. The principles of our acceleration design apply to both FPGA and GPU. Compared to the multi-threading CPU baseline, our GPU accelerator achieves 7x acceleration, while our FPGA accelerator achieves 28x acceleration. We further conduct an architecture study to quantitatively analyze the architectural reason for the performance difference. The summarized insights could serve as a guide on choosing the proper hardware acceleration platform.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"41 13","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113936078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Yosys+nextpnr: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs Yosys+ nextpr:从Verilog到Bitstream的商业fpga开源框架

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-03-25 DOI: 10.1109/FCCM.2019.00010

David Shah, Eddie Hung, C. Wolf, Serge Bazanski, D. Gisselquist, Miodrag Milanovic

引用次数: 61

Cost-Effective Energy Monitoring of a Zynq-Based Real-Time System Including Dual Gigabit Ethernet 基于zynq的双千兆以太网实时系统的高性价比能源监测

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-03-22 DOI: 10.1109/FCCM.2019.00068

M. Geier, Dominik Faller, Marian Brändle, S. Chakraborty

{"title":"Cost-Effective Energy Monitoring of a Zynq-Based Real-Time System Including Dual Gigabit Ethernet","authors":"M. Geier, Dominik Faller, Marian Brändle, S. Chakraborty","doi":"10.1109/FCCM.2019.00068","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00068","url":null,"abstract":"Recent FPGA architectures integrate various power management features already established in CPU-driven SoCs to reach more energy-sensitive application domains such as, e.g., automotive and robotics. This also qualifies hybrid Programmable SoCs (pSoCs) that combine fixed-function SoCs with configurable FPGA fabric for heterogeneous Real-time Systems (RTSs), which operate under predefined latency and power constraints in safety-critical environments. Their complex application-specific computation and communication (incl. I/O) architectures result in highly varying power consumption, which requires precise voltage and current sensing on all relevant supply rails to enable dependable evaluation of available and novel power management techniques. In this paper, we propose a low-cost 18-channel 16-bit-resolution measurement system capable of over 200 kSPS (kilo-samples per second) for instrumentation of current pSoC development boards. In addition, we propose to include crucial I/O components such as Ethernet PHYs into the power monitoring to gain a holistic view on the RTS's temporal behavior covering not only computation on FPGA and CPUs, but also communication in terms of, e.g., reception of sensor values and transmission of actuation signals. We present an FMC-sized implementation of our measurement system combined with two Gigabit Ethernet PHYs and one HDMI input. Paired with Xilinx' ZC702 development board, we are able to synchronously acquire power traces of a Zynq pSoC and the two PHYs precise enough to identify individual Ethernet frames.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124517171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Module-per-Object: A Human-Driven Methodology for C++-Based High-Level Synthesis Design 面向对象的模块:基于c++的高级综合设计的人为驱动方法

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-03-05 DOI: 10.1109/FCCM.2019.00037

Jeferson Santiago da Silva, F. Boyer, J. Langlois

{"title":"Module-per-Object: A Human-Driven Methodology for C++-Based High-Level Synthesis Design","authors":"Jeferson Santiago da Silva, F. Boyer, J. Langlois","doi":"10.1109/FCCM.2019.00037","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00037","url":null,"abstract":"High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to handwritten VHDL code while keeping a high abstraction level, humanreadable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parameterization while eliminating the incidence of code duplication.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"09 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134523069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8