Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays最新文献

筛选
英文 中文
Modular multi-ported SRAM-based memories 模块化多端口sram存储器
Ameer Abdelhadi, G. Lemieux
{"title":"Modular multi-ported SRAM-based memories","authors":"Ameer Abdelhadi, G. Lemieux","doi":"10.1145/2554688.2554773","DOIUrl":"https://doi.org/10.1145/2554688.2554773","url":null,"abstract":"Multi-ported RAMs are essential for high-performance parallel computation systems. VLIW and vector processors, CGRAs, DSPs, CMPs and other processing systems often rely upon multi-ported memories for parallel access, hence higher performance. Although memories with a large number of read and write ports are important, their high implementation cost means they are used sparingly in designs. As a result, FPGA vendors only provide dual-ported block RAMs to handle the majority of usage patterns. In this paper, a novel and modular approach is proposed to construct multi-ported memories out of basic dual-ported RAM blocks. Like other multi-ported RAM designs, each write port uses a different RAM bank and each read port uses bank replication. The main contribution of this work is an optimization that merges the previous live-value-table (LVT) and XOR approaches into a common design that uses a generalized, simpler structure we call an invalidation-based live-value-table (I-LVT). Like a regular LVT, the I-LVT determines the correct bank to read from, but it differs in how updates to the table are made; the LVT approach requires multiple write ports, often leading to an area-intensive register-based implementation, while the XOR approach uses wider memories to accommodate the XOR-ed data and suffers from lower clock speeds. Two specific I-LVT implementations are proposed and evaluated, binary and one-hot coding. The I-LVT approach is especially suitable for larger multi-ported RAMs because the table is implemented only in SRAM cells. The I-LVT method gives higher performance while occupying less block RAMs than earlier approaches: for several configurations, the suggested method reduces the block RAM usage by over 44% and improves clock speed by over 76%. To assist others, we are releasing our fully parameterized Verilog implementation as an open source hardware library. The library has been extensively tested using ModelSim and Altera's Quartus tools.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131051711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Non-adaptive sparse recovery and fault evasion using disjunct design configurations (abstract only) 基于分离设计配置的非自适应稀疏恢复和故障规避(仅摘要)
Ahmad Alzahrani, R. Demara
{"title":"Non-adaptive sparse recovery and fault evasion using disjunct design configurations (abstract only)","authors":"Ahmad Alzahrani, R. Demara","doi":"10.1145/2554688.2554758","DOIUrl":"https://doi.org/10.1145/2554688.2554758","url":null,"abstract":"A run-time fault diagnosis and evasion scheme for reconfigurable devices is developed based on an explicit Non-adaptive Group Testing (NGT). NGT involves grouping disjunct subsets of reconfigurable resources into test pools, or samples. Each test pool realizes a Diagnostic Configuration (DC) performing functional testing during diagnosis procedure. The collective test outcomes after testing each diagnostic pool can be efficiently decoded to identify up to d defective logic resources. An algorithm for constructing NGT sampling procedure and resource placement during design time with optimal minimal number of test groups is derived through the well-known in statistical literature d-disjunctness property. The combinatorial properties of resultant DCs also guarantee that any possible set of defective resources less than or equal to d are not utilized by at least one DC, allowing a low-overhead fault resolution. It also provides the ability to assess the resources state of failure. The proposed testing scheme thus avoids time-intensive run-time diagnosis imposed by previously proposed adaptive group testing for reconfigurable hardware without compromising diagnostic coverage. In addition, proposed NGT scheme can be combined with other fault tolerance approaches to ameliorate their fault recovery strategies. Experimental results for a set of MCNC benchmarks using Xilinx ISE Design Suite on a Virtex-5 FPGA have demonstrated d-diagnosability at slice level with average accuracy of 99.15% and 97.76% for d=1 and d=2, respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128708820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Novel FPGA clock network with low latency and skew (abstract only) 一种新颖的低时延、低倾斜FPGA时钟网络(仅抽象)
Lei Li, Jian Wang, Jinmei Lai
{"title":"Novel FPGA clock network with low latency and skew (abstract only)","authors":"Lei Li, Jian Wang, Jinmei Lai","doi":"10.1145/2554688.2554722","DOIUrl":"https://doi.org/10.1145/2554688.2554722","url":null,"abstract":"Clock network is a dedicated network for distributing multiple clock signals to every logic modules in a system. Be significantly different from ASIC where the clock tree is custom built by users, clock network in FPGA is usually fixed after chip fabrication and cannot be changed for different user circuits. This paper is committed to design and implement FPGA clock network with low latency and skew. We first propose a novel clock network for FPG, which is a backbone-branches topology and can be easily integrated to the tiled FPGA with reasonable area. There are one clock backbone and several primary clock branches in the network. When the chip scales up, this clock network can be extended easily. Afterwards, series of strategies such as hybrid multiplexer, bypassing, looping back and Programmable Delay Adjustment Unit (DAU) are employed to optimize latency and skew. Moreover, the prominent couple capacitance and crosstalk effect of clock routing in nanometer are also given consideration in physical implementation. This clock network is applied to own-designed FPGA with 65nm technology. Post-layout simulation results indicate that our clock network with normal loads can uphold 600MHz clock with the maximum clock latency and skew being typically 2.22ns and 40ps respectively, 1.79ns and 39ps in the fast case, achieving up to 78.2% improvement for skew as well as 47.5% for latency, compared to a commercial 65nm FPGA device.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127175231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards interconnect-adaptive packing for FPGAs fpga互连自适应封装研究
J. Luu, Jonathan Rose, J. Anderson
{"title":"Towards interconnect-adaptive packing for FPGAs","authors":"J. Luu, Jonathan Rose, J. Anderson","doi":"10.1145/2554688.2554783","DOIUrl":"https://doi.org/10.1145/2554688.2554783","url":null,"abstract":"In order to investigate new FPGA logic blocks, FPGA architects have traditionally needed to customize CAD tools to make use of the new features and characteristics of those blocks. The software development effort necessary to create such CAD tools can be a time-consuming process that can significantly limit the number and variety of architectures explored. Thus, architects want flexible CAD tools that can, with few or no software modifications, explore a diverse space. Existing flexible CAD tools suffer from impractically long runtimes and/or fail to efficiently make use of the important new features of the logic blocks being investigated. This work is a step towards addressing these concerns by enhancing the packing stage of the open-source VTR CAD flow [17] to efficiently deal with common interconnect structures that are used to create many kinds of useful novel blocks. These structures include crossbars, carry chains, dedicated signals, and others. To accomplish this, we employ three techniques in this work: speculative packing, pre-packing, and interconnect-aware pin counting. We show that these techniques, along with three minor modifications, result in improvements to runtime and quality of results across a spectrum of architectures, while simultaneously expanding the scope of architectures that can be explored. Compared with VTR 1.0 [17], we show an average 12-fold speedup in packing for fracturable LUT architectures with 20% lower minimum channel width and 6% lower critical path delay. We obtain a 6 to 7-fold speedup for architectures with non-fracturable LUTs and architectures with depopulated crossbars. In addition, we demonstrate packing support for logic blocks with carry chains.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123034422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Optimizing effective interconnect capacitance for FPGA power reduction 优化FPGA降低功耗的有效互连电容
Safeen Huda, J. Anderson, H. Tamura
{"title":"Optimizing effective interconnect capacitance for FPGA power reduction","authors":"Safeen Huda, J. Anderson, H. Tamura","doi":"10.1145/2554688.2554788","DOIUrl":"https://doi.org/10.1145/2554688.2554788","url":null,"abstract":"We propose a technique to reduce the effective parasitic capacitance of interconnect routing conductors in a bid to simultaneously reduce power consumption and improve delay. The parasitic capacitance reduction is achieved by ensuring routing conductors adjacent to those used by timing critical or high activity nets are left floating - disconnected from either VDD or GND. In doing so, the effective coupling capacitance between the conductors is reduced, because the original coupling capacitance between the conductors is placed in series with other capacitances in the circuit (series combinations of capacitors correspond to lower effective capacitance). To ensure unused conductors can be allowed to float requires the use of tri-state routing buffers, and to that end, we also propose low-cost tri-state buffer circuitry. We also introduce CAD techniques to maximize the likelihood that unused routing conductors are made to be adjacent to those used by nets with high activity or low slack, improving both power and speed. Results show that interconnect dynamic power reductions of up to ~15.5% are expected to be achieved with a critical path degradation of ~1%, and a total area overhead of ~2.1%.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129611917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Accelerating massive short reads mapping for next generation sequencing (abstract only) 为下一代测序加速大规模短读段映射(仅摘要)
Chunming Zhang, Wen Tang, Guangming Tan
{"title":"Accelerating massive short reads mapping for next generation sequencing (abstract only)","authors":"Chunming Zhang, Wen Tang, Guangming Tan","doi":"10.1145/2554688.2554707","DOIUrl":"https://doi.org/10.1145/2554688.2554707","url":null,"abstract":"Due to the explosion of gene sequencing data with over one billion reads per run, the data-intensive computations of Next Generation Sequencing (NGS) applications pose great challenges to current computing capability. In this paper we investigate both algorithmic and architectural accelerating strategies to a typical NGS analysis algorithm -- short reads mapping -- on a commodity multicore and customizable FPGA coprocessor architecture, respectively. First, we propose a hash buckets reorder algorithm that increases shared cache parallelism during the course of searching hash index. The algorithmic strategy achieves 122Gbp/day throughput by exploiting shared-cache parallelism, that leads to performance improvement of 2 times on an 8-core Intel Xeon processor. Second, we develop a FPGA coprocessor that leverages both bit-level and word-level parallelism with scatter-gather memory mechanism to speedup inherent irregular memory access operations by increasing effective memory bandwidth. Our customized FPGA coprocessor achieves 947Gbp per day throughput, that is 189 times higher than current mapping tools on single CPU core, and above 2 times higher than a 64-core multi-processor system. The coprocessor's power efficiency is 29 times higher than a conventional 64-core multi-processor. The results indicate that the customized FPGA coprocessor architecture, that is configured with scatter-gather memory's word-level access, appeals to data intensive applications.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129681012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Big data genome sequencing on Zynq based clusters (abstract only) 基于Zynq集群的大数据基因组测序(仅摘要)
Chao Wang, Xi Li, Xuehai Zhou, Yunji Chen, R. Cheung
{"title":"Big data genome sequencing on Zynq based clusters (abstract only)","authors":"Chao Wang, Xi Li, Xuehai Zhou, Yunji Chen, R. Cheung","doi":"10.1145/2554688.2554694","DOIUrl":"https://doi.org/10.1145/2554688.2554694","url":null,"abstract":"Next-generation sequencing (NGS) problems have attracted many attentions of researchers in biological and medical computing domains. The current state-of-the-art NGS computing machines are dramatically lowering the cost and increasing the throughput of DNA sequencing. In this paper, we propose a practical study that uses Xilinx Zynq board to summarize acceleration engines using FPGA accelerators and ARM processors for the state-of-the-art short read mapping approaches. The heterogeneous processors and accelerators are coupled with each other using a general Hadoop distributed processing framework. First the reads are collected by the central server, and then distributed to multiple accelerators on the Zynq for hardware acceleration. Therefore, the combination of hardware acceleration and Map-Reduce execution flow could greatly accelerate the task of aligning short length reads to a known reference genome. Our approach is based on preprocessing the reference genomes and iterative jobs for aligning the continuous incoming reads. The hardware acceleration is based on the creditable read-mapping algorithm RMAP software approach. Furthermore, the speedup analysis on a Hadoop cluster, which concludes 8 development boards, is evaluated. Experimental results demonstrate that our proposed architecture and methods has the speedup of more than 112X, and is scalable with the number of accelerators. Finally, the Zynq based cluster has efficient potential to accelerate even general large scale big data applications. This work was supported by the NSFC grants No. 61379040, No. 61272131 and No. 61202053.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127694893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
OmpSs@Zynq all-programmable SoC ecosystem OmpSs@Zynq全可编程SoC生态系统
Antonio Filgueras, E. Gil, Daniel Jiménez-González, C. Álvarez, X. Martorell, Jan Langer, Juanjo Noguera, K. Vissers
{"title":"OmpSs@Zynq all-programmable SoC ecosystem","authors":"Antonio Filgueras, E. Gil, Daniel Jiménez-González, C. Álvarez, X. Martorell, Jan Langer, Juanjo Noguera, K. Vissers","doi":"10.1145/2554688.2554777","DOIUrl":"https://doi.org/10.1145/2554688.2554777","url":null,"abstract":"OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely influenced the recently appeared OpenMP 4.0 specification. Zynq All-Programmable SoC combines the features of a SMP and a FPGA and benefits DLP, ILP and TLP parallelisms in order to efficiently exploit the new technology improvements and chip resource capacities. In this paper, we focus on programmability and heterogeneous execution support, presenting a successful combination of the OmpSs programming model and the Zynq All-Programmable SoC platforms.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127750317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Theory and algorithm for generalized memory partitioning in high-level synthesis 高级综合中广义内存划分的理论与算法
Yuxin Wang, Peng Li, J. Cong
{"title":"Theory and algorithm for generalized memory partitioning in high-level synthesis","authors":"Yuxin Wang, Peng Li, J. Cong","doi":"10.1145/2554688.2554780","DOIUrl":"https://doi.org/10.1145/2554688.2554780","url":null,"abstract":"The significant development of high-level synthesis tools has greatly facilitated FPGAs as general computing platforms. During the parallelism optimization for the data path, memory becomes a crucial bottleneck that impedes performance enhancement. Simultaneous data access is highly restricted by the data mapping strategy and memory port constraint. Memory partitioning can efficiently map data elements in the same logical array onto multiple physical banks so that the accesses to the array are parallelized. Previous methods for memory partitioning mainly focused on cyclic partitioning for single-port memory. In this work we propose a generalized memory-partitioning framework to provide high data throughput of on-chip memories. We generalize cyclic partitioning into block-cyclic partitioning for a larger design space exploration. We build the conflict detection algorithm on polytope emptiness testing, and use integer points counting in polytopes for intra-bank offset generation. Memory partitioning for multi-port memory is supported in this framework. Experimental results demonstrate that compared to the state-of-art partitioning algorithm, our proposed algorithm can reduce the number of block RAM by 19.58%, slice by 20.26% and DSP by 50%.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125697431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
Quantifying the cost and benefit of latency insensitive communication on FPGAs 量化fpga上延迟不敏感通信的成本和收益
Kevin E. Murray, Vaughn Betz
{"title":"Quantifying the cost and benefit of latency insensitive communication on FPGAs","authors":"Kevin E. Murray, Vaughn Betz","doi":"10.1145/2554688.2554786","DOIUrl":"https://doi.org/10.1145/2554688.2554786","url":null,"abstract":"Latency insensitive communication offers many potential benefits for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to understand the costs and trade-offs associated with any new design style. This paper presents optimized implementations of latency insensitive communication building blocks, quantifies their overheads in terms of area and frequency, and provides guidance to designers on how to generate high-speed and area-efficient latency insensitive systems.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128213601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信