2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines最新文献_第3页

FMSA: FPGA-Accelerated ClustalW-Based Multiple Sequence Alignment through Pipelined Prefiltering FMSA:通过流水线预滤波的fpga加速集群多序列比对

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.38

A. Mahram, M. Herbordt

{"title":"FMSA: FPGA-Accelerated ClustalW-Based Multiple Sequence Alignment through Pipelined Prefiltering","authors":"A. Mahram, M. Herbordt","doi":"10.1109/FCCM.2012.38","DOIUrl":"https://doi.org/10.1109/FCCM.2012.38","url":null,"abstract":"Multiple Sequence Alignment (MSA) is perhaps second only to sequence alignment in overall importance in Bioinformatics, being critical, e.g., in determining the structure and function of molecules from putative families of sequences. But while pair wise sequence alignment has been the subject of scores of FPGA acceleration studies, MSA only a few. The most important of these accelerate Clustal-W, the most commonly used MSA code, by either implementing the first of three phases (over 90% of the run time) with Dynamic Programming (DP) methods, or by accelerating the third phase which consumes most of the remaining time. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 80× to 190× over the CPU code (8 cores) and speedup of from 2.5× to 8× over DP/FPGA- and GPU-based methods. When combined with a recently published method for phase 3, and using the original software for phase 2, the end-to-end speedup is at least 50× over an 8-core implementation of the original code. The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129481713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

A Custom Precision Based Architecture for Accelerating Parallel Tempering MCMC on FPGAs without Introducing Sampling Error 一种基于自定义精度的fpga并行回火MCMC加速结构，且不引入采样误差

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.34

Grigorios Mingas, C. Bouganis

引用次数: 13

Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures Formic:多核架构的成本效益和可扩展原型

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.20

Spyros Lyberis, G. Kalokerinos, Michalis Lygerakis, Vassilis D. Papaefstathiou, Dimitrios Tsaliagkos, M. Katevenis, D. Pnevmatikatos, Dimitrios S. Nikolopoulos

{"title":"Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures","authors":"Spyros Lyberis, G. Kalokerinos, Michalis Lygerakis, Vassilis D. Papaefstathiou, Dimitrios Tsaliagkos, M. Katevenis, D. Pnevmatikatos, Dimitrios S. Nikolopoulos","doi":"10.1109/FCCM.2012.20","DOIUrl":"https://doi.org/10.1109/FCCM.2012.20","url":null,"abstract":"Modeling emerging multicore architectures is challenging and imposes a tradeoff between simulation speed and accuracy. An effective practice that balances both targets well is to map the target architecture on FPGA platforms. We find that accurate prototyping of hundreds of cores on existing FPGA boards faces at least one of the following problems: (i) limited fast memory resources (SRAM) to model caches, (ii) insufficient inter-board connectivity for scaling the design or (iii) the board is too expensive. We address these shortcomings by designing a new FPGA board for multicore architecture prototyping, which explicitly targets scalability and cost-efficiency. Formic has a 35% bigger FPGA, three times more SRAM, four times more links and costs at most half as much when compared to the popular Xilinx XUPV5 prototyping platform. We build and test a 64-board system by developing a 512-core, Micro Blaze-based, non-coherent hardware prototype with DMA capabilities, with full network on-chip in a 3D-mesh topology. We believe that Formic offers significant advantages over existing academic and commercial platforms that can facilitate hardware prototyping for future many core architectures.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"627 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132726525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Towards a Universal FPGA Matrix-Vector Multiplication Architecture 一种通用的FPGA矩阵向量乘法架构

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.12

S. Kestur, John D. Davis, Eric S. Chung

{"title":"Towards a Universal FPGA Matrix-Vector Multiplication Architecture","authors":"S. Kestur, John D. Davis, Eric S. Chung","doi":"10.1109/FCCM.2012.12","DOIUrl":"https://doi.org/10.1109/FCCM.2012.12","url":null,"abstract":"We present the design and implementation of a universal, single-bit stream library for accelerating matrix-vector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtime-programmable decoder that performs on-the-fly-decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122349767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

A Low-Overhead Profiling and Visualization Framework for Hybrid Transactional Memory 混合事务性内存的低开销分析和可视化框架

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.11

Oriol Arcas, Philipp Kirchhofer, Nehir Sönmez, M. Schindewolf, O. Unsal, Wolfgang Karl, A. Cristal

{"title":"A Low-Overhead Profiling and Visualization Framework for Hybrid Transactional Memory","authors":"Oriol Arcas, Philipp Kirchhofer, Nehir Sönmez, M. Schindewolf, O. Unsal, Wolfgang Karl, A. Cristal","doi":"10.1109/FCCM.2012.11","DOIUrl":"https://doi.org/10.1109/FCCM.2012.11","url":null,"abstract":"Multi-core prototyping presents a good opportunity for establishing low overhead and detailed profiling and visualization in order to study new research topics. In this paper, we design and implement a low execution, low area overhead profiling mechanism and a visualization tool for observing Transactional Memory behaviors on FPGA. To achieve this, we non-disruptively create and bring out events on the fly and process them offline on a host. There, our tool regenerates the execution from the collected events and produces traces for comprehensively inspecting the behavior of interacting multithreaded programs. With zero execution overhead for hardware TM events, single-instruction overhead for software TM events, and utilizing a low logic area of 2.3% per processor core, we run TM benchmarks to evaluate various different levels of profiling detail with an average runtime overhead of 6%. We demonstrate the usefulness of such detailed examination of SW/HW transactional behavior in two parts: (i) we speed up a TM benchmark by 24.1%, and (ii) we closely inspect transactions to point out pathologies.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121931241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Power Management Strategies for Serial RapidIO Endpoints in FPGAs fpga串行RapidIO端点的电源管理策略

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.26

Moritz Schmid, Frank Hannig, J. Teich

引用次数: 2

Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes 存储器带宽高效二维快速傅立叶变换算法及大问题的实现

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.40

Berkin Akin, Peter Milder, F. Franchetti, J. Hoe

{"title":"Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes","authors":"Berkin Akin, Peter Milder, F. Franchetti, J. Hoe","doi":"10.1109/FCCM.2012.40","DOIUrl":"https://doi.org/10.1109/FCCM.2012.40","url":null,"abstract":"Prevailing VLSI trends point to a growing gap between the scaling of on-chip processing throughput and off-chip memory bandwidth. An efficient use of memory bandwidth must become a first-class design consideration in order to fully utilize the processing capability of highly concurrent processing platforms like FPGAs. In this paper, we present key aspects of this challenge in developing FPGA-based implementations of two-dimensional fast Fourier transform (2D-FFT) where the large datasets must reside off-chip in DRAM. Our scalable implementations address the memory bandwidth bottleneck through both (1) algorithm design to enable efficient DRAM access patterns and (2) data path design to extract the maximum compute throughput for a given level of memory bandwidth. We present results for double-precision 2D-FFT up to size 2,048-by-2,048. On an Alter a DE4 platform our implementation of the 2,048-by-2,048 2D-FFT can achieve over 19.2 Gflop/s from the 12 GByte/s maximum DRAM bandwidth available. The results also show that our FPGA-based implementations of 2D-FFT are more efficient than 2D-FFT running on state-of-the-art CPUs and GPUs in terms of the bandwidth and power efficiency.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116944174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

RIFFA: A Reusable Integration Framework for FPGA Accelerators FPGA加速器的可重用集成框架

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.44

Matthew Jacobsen, Y. Freund, R. Kastner

引用次数: 65

Area-Efficient Architectures for Large Integer and Quadruple Precision Floating Point Multipliers 大整数和四倍精度浮点乘法器的面积高效架构

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.14

M. Jaiswal, R. Cheung

{"title":"Area-Efficient Architectures for Large Integer and Quadruple Precision Floating Point Multipliers","authors":"M. Jaiswal, R. Cheung","doi":"10.1109/FCCM.2012.14","DOIUrl":"https://doi.org/10.1109/FCCM.2012.14","url":null,"abstract":"Large integer multiplication and floating point multiplication are the two dominating operations for many scientific and cryptographic applications. Large integer multipliers generally have linearly but high area requirement according to a given bit-width. High precision requirements of a given application lead to the use of quadruple precision arithmetic, however its operation is dominated by large integer multiplication of the mantissa product. In this paper, we propose a hardware efficient approach for implementing a fully pipelined large integer multipliers, and further extending it to Quadruple Precision (QP) floating point multiplication. The proposed design uses less hardware resources in terms of DSP48 blocks and slices, while attaining high performance. Promising results are obtained when compared our designs with the best reported large integer multipliers and also QP floating point multiplier in literatures. For instance, our results have demonstrated a significant improvement for the proposed QP multiplier, for over 50% improvement in terms of the DSP48 block usage with a penalty of slight additional slices, when compared to the best result in the literature on a Virtex-4 device.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117350746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Exploiting Modified Placement and Hardwired Resources to Provide High Reliability in FPGAs 利用改进的布局和硬连线资源在fpga中提供高可靠性

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI: 10.1109/FCCM.2012.56

G. Nazar, L. Carro

引用次数: 13