2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines最新文献_第2页

An Approach to a Fully Automated Partial Reconfiguration Design Flow 一种全自动部分重构设计流程的方法

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.33

Kizheppatt Vipin, Suhaib A. Fahmy

引用次数: 2

An FPGA-Based Data Flow Engine for Gaussian Copula Model 基于fpga的高斯Copula模型数据流引擎

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.14

Huabin Ruan, Xiaomeng Huang, H. Fu, Guangwen Yang, W. Luk, S. Racanière, O. Pell, Wenji Han

{"title":"An FPGA-Based Data Flow Engine for Gaussian Copula Model","authors":"Huabin Ruan, Xiaomeng Huang, H. Fu, Guangwen Yang, W. Luk, S. Racanière, O. Pell, Wenji Han","doi":"10.1109/FCCM.2013.14","DOIUrl":"https://doi.org/10.1109/FCCM.2013.14","url":null,"abstract":"The Gaussian Copula Model (GCM) plays an important role in the state-of-the-art financial analysis field for modeling the dependence of financial assets. However, the existing implementations of GCM are all computationallydemanding and time-consuming. In this paper, we propose a Dataflow Engine (DFE) design to accelerate the GCM computation. Specifically, a commonly used CPU-friendly GCM algorithm is converted into a fully-pipelined dataflow graph through four steps of optimization: recomposing the algorithm to be pipeline-friendly, removing unnecessary computation, sharing common computing results, and reducing the computing precision while maintaining the same level of accuracy for the computation results. The performance of the proposed DFE design is compared with three CPU-based implementations that are well-optimized. Experimental results show that our DFE solution not only generates fairly accurate result, but also achieves a maximum of 467x speedup over a single-thread CPU-based solution, 120x speedup over a multi-thread CPUbased solution, and 47x speedup over an MPI-based solution.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124959465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

High-Level Description and Synthesis of Floating-Point Accumulators on FPGA FPGA上浮点累加器的高级描述与综合

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.37

Marc-André Daigneault, J. David

{"title":"High-Level Description and Synthesis of Floating-Point Accumulators on FPGA","authors":"Marc-André Daigneault, J. David","doi":"10.1109/FCCM.2013.37","DOIUrl":"https://doi.org/10.1109/FCCM.2013.37","url":null,"abstract":"Decades of research in the field of high level hardware description now result in tools that are able to automatically transform C/C++ constructs into highly optimized parallel and pipelined architectures. Such approaches work fine when the control flow is a priory known since the computation results in a large dataflow graph that can be mapped into the available operators. Nevertheless, some applications have a control flow that is highly dependant on the data. This paper focuses on the hardware implementation of such applications and presents a high level synthesis methodology applied to a Hardware Description Language (HDL) in which assignments correspond to self-synchronized connections between predefined data streaming sources and sinks. A data transfer occurs over an established connection when both source and sink are ready, according to their synchronization interfaces. Founded on a high-level communicating FSM programming model, the language allows the user to describe and dynamically modify streaming architectures exploiting spatial and temporal parallelism. Our compiler attempts to maximize the number of transfers at each clock cycle and automatically fixes the potential combinatorial loops induced by the dynamic connection of dependant sources and sinks. The methodology is applied to the synthesis of a pipelined floating point accumulator using the Delayed-Buffering (DB) reduction method. The results we obtain are similar to state-of-the-art dedicated architectures but require much less design time and expertise.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116418531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Parallel Computation of Skyline Queries Skyline查询的并行计算

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.18

L. Woods, G. Alonso, J. Teubner

引用次数: 45

Reconfigurable Acceleration of Short Read Mapping 短读映射的可重构加速

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.57

James Arram, K. H. Tsoi, W. Luk, P. Jiang

引用次数: 40

An FPGA Based PCI-E Root Complex Architecture for Standalone SOPCs 基于FPGA的独立单片机PCI-E根复合体结构

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.29

Yingjie Cao, Yongxin Zhu, Xu Wang, Jiang Jiang, Meikang Qiu

引用次数: 3

Automating Elimination of Idle Functions by Run-Time Reconfiguration 通过运行时重新配置自动消除空闲函数

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1145/2700415

Xinyu Niu, T. Chau, Qiwei Jin, W. Luk, Qiang Liu, O. Pell

引用次数: 19

A Multithreaded VLIW Soft Processor Family 一个多线程VLIW软处理器家族

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.36

Kalin Ovtcharov, Ilian Tili, J. Steffan

{"title":"A Multithreaded VLIW Soft Processor Family","authors":"Kalin Ovtcharov, Ilian Tili, J. Steffan","doi":"10.1109/FCCM.2013.36","DOIUrl":"https://doi.org/10.1109/FCCM.2013.36","url":null,"abstract":"Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both Hodgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in th","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133863245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Image Segmentation Using Hardware Forest Classifiers

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.20

Richard Neil Pittman, A. Forin, A. Criminisi, J. Shotton, A. Mahram

{"title":"Image Segmentation Using Hardware Forest Classifiers","authors":"Richard Neil Pittman, A. Forin, A. Criminisi, J. Shotton, A. Mahram","doi":"10.1109/FCCM.2013.20","DOIUrl":"https://doi.org/10.1109/FCCM.2013.20","url":null,"abstract":"Image segmentation is the process of partitioning an image into segments or subsets of pixels for purposes of further analysis, such as separating the interesting objects in the foreground from the un-interesting objects in the background. In many image processing applications, the process requires a sequence of computational steps on a per pixel basis, thereby binding the performance to the size and resolution of the image. As applications require greater resolution and larger images the computational resources of this step can quickly exceed those of available CPUs, especially in the power and thermal constrained areas of consumer electronics and mobile. In this work, we use a hardware tree-based classifier to solve the image segmentation problem. The application is background removal (BGR) from depth-maps obtained from the Microsoft Kinect sensor. After the image is segmented, subsequent steps then classify the objects in the scene. The approach is flexible: to address different application domains we only need to change the trees used by the classifiers. We describe two distinct approaches and evaluate their performance using the commercial-grade testing environment used for the Microsoft Xbox gaming console.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123974294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Open-Source Bitstream Generation 开源比特流生成

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI: 10.1109/FCCM.2013.45

Ritesh Soni, Neil Steiner, M. French

引用次数: 21