2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines最新文献

筛选
英文 中文
FMSA: FPGA-Accelerated ClustalW-Based Multiple Sequence Alignment through Pipelined Prefiltering FMSA:通过流水线预滤波的fpga加速集群多序列比对
A. Mahram, M. Herbordt
{"title":"FMSA: FPGA-Accelerated ClustalW-Based Multiple Sequence Alignment through Pipelined Prefiltering","authors":"A. Mahram, M. Herbordt","doi":"10.1109/FCCM.2012.38","DOIUrl":"https://doi.org/10.1109/FCCM.2012.38","url":null,"abstract":"Multiple Sequence Alignment (MSA) is perhaps second only to sequence alignment in overall importance in Bioinformatics, being critical, e.g., in determining the structure and function of molecules from putative families of sequences. But while pair wise sequence alignment has been the subject of scores of FPGA acceleration studies, MSA only a few. The most important of these accelerate Clustal-W, the most commonly used MSA code, by either implementing the first of three phases (over 90% of the run time) with Dynamic Programming (DP) methods, or by accelerating the third phase which consumes most of the remaining time. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 80× to 190× over the CPU code (8 cores) and speedup of from 2.5× to 8× over DP/FPGA- and GPU-based methods. When combined with a recently published method for phase 3, and using the original software for phase 2, the end-to-end speedup is at least 50× over an 8-core implementation of the original code. The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129481713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A Custom Precision Based Architecture for Accelerating Parallel Tempering MCMC on FPGAs without Introducing Sampling Error 一种基于自定义精度的fpga并行回火MCMC加速结构,且不引入采样误差
Grigorios Mingas, C. Bouganis
{"title":"A Custom Precision Based Architecture for Accelerating Parallel Tempering MCMC on FPGAs without Introducing Sampling Error","authors":"Grigorios Mingas, C. Bouganis","doi":"10.1109/FCCM.2012.34","DOIUrl":"https://doi.org/10.1109/FCCM.2012.34","url":null,"abstract":"Markov Chain Monte Carlo (MCMC) is a method used to draw samples from probability distributions in order to estimate - otherwise intractable - integrals. When the distribution is complex, simple MCMC becomes inefficient and advanced, computationally intensive MCMC methods are employed to make sampling possible. This work proposes a novel streaming FPGA architecture to accelerate Parallel Tempering, a widely adopted MCMC method designed to sample from multimodal distributions. The proposed architecture demonstrates how custom precision can be intelligently employed without introducing sampling errors, in order to save resources and increase the sampling throughg put. Speedups of up to two orders of magnitude compared to software and 1.53x-76.88x compared to a GPGPU implementation are achieved when performing Bayesian inference for a mixture model.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126090806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures Formic:多核架构的成本效益和可扩展原型
Spyros Lyberis, G. Kalokerinos, Michalis Lygerakis, Vassilis D. Papaefstathiou, Dimitrios Tsaliagkos, M. Katevenis, D. Pnevmatikatos, Dimitrios S. Nikolopoulos
{"title":"Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures","authors":"Spyros Lyberis, G. Kalokerinos, Michalis Lygerakis, Vassilis D. Papaefstathiou, Dimitrios Tsaliagkos, M. Katevenis, D. Pnevmatikatos, Dimitrios S. Nikolopoulos","doi":"10.1109/FCCM.2012.20","DOIUrl":"https://doi.org/10.1109/FCCM.2012.20","url":null,"abstract":"Modeling emerging multicore architectures is challenging and imposes a tradeoff between simulation speed and accuracy. An effective practice that balances both targets well is to map the target architecture on FPGA platforms. We find that accurate prototyping of hundreds of cores on existing FPGA boards faces at least one of the following problems: (i) limited fast memory resources (SRAM) to model caches, (ii) insufficient inter-board connectivity for scaling the design or (iii) the board is too expensive. We address these shortcomings by designing a new FPGA board for multicore architecture prototyping, which explicitly targets scalability and cost-efficiency. Formic has a 35% bigger FPGA, three times more SRAM, four times more links and costs at most half as much when compared to the popular Xilinx XUPV5 prototyping platform. We build and test a 64-board system by developing a 512-core, Micro Blaze-based, non-coherent hardware prototype with DMA capabilities, with full network on-chip in a 3D-mesh topology. We believe that Formic offers significant advantages over existing academic and commercial platforms that can facilitate hardware prototyping for future many core architectures.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"627 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132726525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Towards a Universal FPGA Matrix-Vector Multiplication Architecture 一种通用的FPGA矩阵向量乘法架构
S. Kestur, John D. Davis, Eric S. Chung
{"title":"Towards a Universal FPGA Matrix-Vector Multiplication Architecture","authors":"S. Kestur, John D. Davis, Eric S. Chung","doi":"10.1109/FCCM.2012.12","DOIUrl":"https://doi.org/10.1109/FCCM.2012.12","url":null,"abstract":"We present the design and implementation of a universal, single-bit stream library for accelerating matrix-vector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtime-programmable decoder that performs on-the-fly-decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122349767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
A Low-Overhead Profiling and Visualization Framework for Hybrid Transactional Memory 混合事务性内存的低开销分析和可视化框架
Oriol Arcas, Philipp Kirchhofer, Nehir Sönmez, M. Schindewolf, O. Unsal, Wolfgang Karl, A. Cristal
{"title":"A Low-Overhead Profiling and Visualization Framework for Hybrid Transactional Memory","authors":"Oriol Arcas, Philipp Kirchhofer, Nehir Sönmez, M. Schindewolf, O. Unsal, Wolfgang Karl, A. Cristal","doi":"10.1109/FCCM.2012.11","DOIUrl":"https://doi.org/10.1109/FCCM.2012.11","url":null,"abstract":"Multi-core prototyping presents a good opportunity for establishing low overhead and detailed profiling and visualization in order to study new research topics. In this paper, we design and implement a low execution, low area overhead profiling mechanism and a visualization tool for observing Transactional Memory behaviors on FPGA. To achieve this, we non-disruptively create and bring out events on the fly and process them offline on a host. There, our tool regenerates the execution from the collected events and produces traces for comprehensively inspecting the behavior of interacting multithreaded programs. With zero execution overhead for hardware TM events, single-instruction overhead for software TM events, and utilizing a low logic area of 2.3% per processor core, we run TM benchmarks to evaluate various different levels of profiling detail with an average runtime overhead of 6%. We demonstrate the usefulness of such detailed examination of SW/HW transactional behavior in two parts: (i) we speed up a TM benchmark by 24.1%, and (ii) we closely inspect transactions to point out pathologies.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121931241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Power Management Strategies for Serial RapidIO Endpoints in FPGAs fpga串行RapidIO端点的电源管理策略
Moritz Schmid, Frank Hannig, J. Teich
{"title":"Power Management Strategies for Serial RapidIO Endpoints in FPGAs","authors":"Moritz Schmid, Frank Hannig, J. Teich","doi":"10.1109/FCCM.2012.26","DOIUrl":"https://doi.org/10.1109/FCCM.2012.26","url":null,"abstract":"We propose a novel data budget-based approach to dynamically control the average power consumption of Serial RapidIO endpoint controllers in FPGAs. The key concept of the approach is to not only perform clock-gating on the FPGA-internal components of the communication controller, but to disable the multi-gigabit transceivers during idle periods. The clock synchronization, inherent to serial interfaces, enables us to omit the often needed periodic link sensing, and only enable the controller according to a predefined schedule to transmit the allocated amount of data during a specific interval. Following this approach, we are able to reduce the dynamic power consumption by up to 77% on average.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121543935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes 存储器带宽高效二维快速傅立叶变换算法及大问题的实现
Berkin Akin, Peter Milder, F. Franchetti, J. Hoe
{"title":"Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes","authors":"Berkin Akin, Peter Milder, F. Franchetti, J. Hoe","doi":"10.1109/FCCM.2012.40","DOIUrl":"https://doi.org/10.1109/FCCM.2012.40","url":null,"abstract":"Prevailing VLSI trends point to a growing gap between the scaling of on-chip processing throughput and off-chip memory bandwidth. An efficient use of memory bandwidth must become a first-class design consideration in order to fully utilize the processing capability of highly concurrent processing platforms like FPGAs. In this paper, we present key aspects of this challenge in developing FPGA-based implementations of two-dimensional fast Fourier transform (2D-FFT) where the large datasets must reside off-chip in DRAM. Our scalable implementations address the memory bandwidth bottleneck through both (1) algorithm design to enable efficient DRAM access patterns and (2) data path design to extract the maximum compute throughput for a given level of memory bandwidth. We present results for double-precision 2D-FFT up to size 2,048-by-2,048. On an Alter a DE4 platform our implementation of the 2,048-by-2,048 2D-FFT can achieve over 19.2 Gflop/s from the 12 GByte/s maximum DRAM bandwidth available. The results also show that our FPGA-based implementations of 2D-FFT are more efficient than 2D-FFT running on state-of-the-art CPUs and GPUs in terms of the bandwidth and power efficiency.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116944174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
RIFFA: A Reusable Integration Framework for FPGA Accelerators FPGA加速器的可重用集成框架
Matthew Jacobsen, Y. Freund, R. Kastner
{"title":"RIFFA: A Reusable Integration Framework for FPGA Accelerators","authors":"Matthew Jacobsen, Y. Freund, R. Kastner","doi":"10.1109/FCCM.2012.44","DOIUrl":"https://doi.org/10.1109/FCCM.2012.44","url":null,"abstract":"We present RIFFA, a reusable integration framework for FPGA accelerators. RIFFA provides communication and synchronization for FPGA accelerated software using a standard interface. Our goal is to expand the use of FPGAs as an acceleration platform by releasing, as open source, a no cost framework that easily integrates software on traditional CPUs with FPGA based IP cores, over PCIe, with minimal custom configuration. RIFFA requires no specialized hardware or fee licensed IP cores. It can be deployed on common Linux workstations with a PCIe bus and has been tested on two different Linux distributions using Xilinx FPGAs.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125247001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Area-Efficient Architectures for Large Integer and Quadruple Precision Floating Point Multipliers 大整数和四倍精度浮点乘法器的面积高效架构
M. Jaiswal, R. Cheung
{"title":"Area-Efficient Architectures for Large Integer and Quadruple Precision Floating Point Multipliers","authors":"M. Jaiswal, R. Cheung","doi":"10.1109/FCCM.2012.14","DOIUrl":"https://doi.org/10.1109/FCCM.2012.14","url":null,"abstract":"Large integer multiplication and floating point multiplication are the two dominating operations for many scientific and cryptographic applications. Large integer multipliers generally have linearly but high area requirement according to a given bit-width. High precision requirements of a given application lead to the use of quadruple precision arithmetic, however its operation is dominated by large integer multiplication of the mantissa product. In this paper, we propose a hardware efficient approach for implementing a fully pipelined large integer multipliers, and further extending it to Quadruple Precision (QP) floating point multiplication. The proposed design uses less hardware resources in terms of DSP48 blocks and slices, while attaining high performance. Promising results are obtained when compared our designs with the best reported large integer multipliers and also QP floating point multiplier in literatures. For instance, our results have demonstrated a significant improvement for the proposed QP multiplier, for over 50% improvement in terms of the DSP48 block usage with a penalty of slight additional slices, when compared to the best result in the literature on a Virtex-4 device.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117350746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Exploiting Modified Placement and Hardwired Resources to Provide High Reliability in FPGAs 利用改进的布局和硬连线资源在fpga中提供高可靠性
G. Nazar, L. Carro
{"title":"Exploiting Modified Placement and Hardwired Resources to Provide High Reliability in FPGAs","authors":"G. Nazar, L. Carro","doi":"10.1109/FCCM.2012.56","DOIUrl":"https://doi.org/10.1109/FCCM.2012.56","url":null,"abstract":"Possible scenarios for future manufacturing technologies increase the desirable features of fault tolerance techniques, such as coping with multiple faults and reducing error latency. On the other hand, current high-end FPGAs present, besides lookup tables and flip-flops, several dedicated components that perform the most commonly required functions. In this paper, we propose an approach to use such resources to efficiently provide fault detection capabilities. We further extend the technique with placement constraints to enhance the detection of faults affecting the routing resources, which is a critical demand for such devices.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116119567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信