2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

筛选
英文 中文
Automating Optimization of Reconfigurable Designs 可重构设计的自动化优化
Maciej Kurek, Tobias Becker, T. Chau, W. Luk
{"title":"Automating Optimization of Reconfigurable Designs","authors":"Maciej Kurek, Tobias Becker, T. Chau, W. Luk","doi":"10.1109/FCCM.2014.65","DOIUrl":"https://doi.org/10.1109/FCCM.2014.65","url":null,"abstract":"We present Automatic Reconfigurable Design Efficient Global Optimization (ARDEGO), a new algorithm based on the existing Efficient Global Optimization (EGO) methodology for automating optimization of reconfigurable designs targeting Field-Programmable Gate Array (FPGA) technology. It is a potentially disruptive design approach: instead of manually improving designs repeatedly but without understanding the design space as a whole, ARDEGO users follow a novel approach that: (a) automates the manual optimization process, significantly reducing optimization time and (b) does not require the user to calibrate or understand the inner workings of the algorithm. We evaluate ARDEGO using two case studies: financial option pricing and seismic imaging.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133995576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Customizable Compression Architecture for Efficient Configuration in CGRAs 可定制的压缩体系结构在CGRAs中的高效配置
Syed M. A. H. Jafri, Muhammad Adeel Tajammul, M. Daneshtalab, A. Hemani, K. Paul, P. Ellervee, J. Plosila, H. Tenhunen
{"title":"Customizable Compression Architecture for Efficient Configuration in CGRAs","authors":"Syed M. A. H. Jafri, Muhammad Adeel Tajammul, M. Daneshtalab, A. Hemani, K. Paul, P. Ellervee, J. Plosila, H. Tenhunen","doi":"10.1109/FCCM.2014.18","DOIUrl":"https://doi.org/10.1109/FCCM.2014.18","url":null,"abstract":"Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads. As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties, and is therefore best suited for a particular class of applications. However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114262083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automated Partial Reconfiguration Design for Adaptive Systems with CoPR for Zynq 基于Zynq的CoPR自适应系统的自动部分重构设计
Kizheppatt Vipin, Suhaib A. Fahmy
{"title":"Automated Partial Reconfiguration Design for Adaptive Systems with CoPR for Zynq","authors":"Kizheppatt Vipin, Suhaib A. Fahmy","doi":"10.1109/FCCM.2014.63","DOIUrl":"https://doi.org/10.1109/FCCM.2014.63","url":null,"abstract":"Dynamically adaptive systems (DAS) respond to environmental conditions, by modifying their processing at runtime and selecting alternative configurations of computation. Field programmable gate arrays, with their support for partial reconfiguration (PR) represent an ideal platform for implementing such systems. Designing partially reconfigurable systems has traditionally been a difficult task requiring FPGA expertise. This paper presents a fully automated framework for implementing PR based adaptive systems. The designer specifies a set of valid configurations containing instances of modules from a standard library. The tool automates partitioning of modules into regions, floorplanning regions on the FPGA fabric, and generation of bitstreams. A runtime system manages the loading of bitstreams automatically through API calls.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124177225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Fast and Power Efficient Heapsort IP for Image Compression Application 快速和高效的堆排序IP图像压缩应用程序
Yuhui Bai, S. Z. Ahmed, B. Granado
{"title":"Fast and Power Efficient Heapsort IP for Image Compression Application","authors":"Yuhui Bai, S. Z. Ahmed, B. Granado","doi":"10.1109/FCCM.2014.72","DOIUrl":"https://doi.org/10.1109/FCCM.2014.72","url":null,"abstract":"We present a hardware architecture of a heapsort algorithm, the sorting is employed in a subband coding block of a wavelet-based image coder termed Öktem image coder [1]. Although this coder provides good image quality, the sorting is time consuming, and is application specific, as the sorting is repetitively used for different volume of data in the subband coding, thus a simple hardware implementation with fixed sorting capacity will be difficult to scale during runtime. To tackle this problem, the time/power efficiency and the sorting size flexibility have to be taken in to account. We proposed an improved FPGA heapsort architecture based on Zabołotny's work [2] as an IP accelerator of the image coder. We present a configurable architecture by using adaptive layer enable elements so the sorting capacity could be adjusted during runtime to efficiently sort different amount of data. With the adaptive memory shutdown, our improved architecture provides up to 20.9% power reduction on the memories compared to the baseline implementation. Moreover, our architecture provides 13x speedup compared to ARM CortexA9.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120928705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Scalable Multi-engine Xpress9 Compressor with Asynchronous Data Transfer 具有异步数据传输的可扩展多引擎Xpress9压缩器
Joo-Young Kim, S. Hauck, D. Burger
{"title":"A Scalable Multi-engine Xpress9 Compressor with Asynchronous Data Transfer","authors":"Joo-Young Kim, S. Hauck, D. Burger","doi":"10.1109/FCCM.2014.49","DOIUrl":"https://doi.org/10.1109/FCCM.2014.49","url":null,"abstract":"Data compression is crucial in large-scale storage servers to save both storage and network bandwidth, but it suffers from high computational cost. In this work, we present a high throughput FPGA based compressor as a PCIe accelerator to achieve CPU resource saving and high power efficiency. The proposed compressor is differentiated from previous hardware compressors by the following features: 1) targeting Xpress9 algorithm, whose compression quality is comparable to the best Gzip implementation (level 9); 2) a scalable multi-engine architecture with various IP blocks to handle algorithmic complexity as well as to achieve high throughput; 3) supporting a heavily multi-threaded server environment with an asynchronous data transfer interface between the host and the accelerator. The implemented Xpress9 compressor on Altera Stratix V GS performs 1.6-2.4Gbps throughput with 7 engines on various compression benchmarks, supporting up to 128 thread contexts.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128495382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Integrated CUDA-to-FPGA Synthesis with Network-on-Chip 集成CUDA-to-FPGA合成与片上网络
S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen
{"title":"Integrated CUDA-to-FPGA Synthesis with Network-on-Chip","authors":"S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen","doi":"10.1109/FCCM.2014.14","DOIUrl":"https://doi.org/10.1109/FCCM.2014.14","url":null,"abstract":"Data parallel languages such as CUDA and Open CL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127237384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor 内置块RAM辐射粒子传感器的fpga自适应SEU缓解系统
R. Glein, Bernhard Schmidt, F. Rittner, J. Teich, Daniel Ziener
{"title":"A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor","authors":"R. Glein, Bernhard Schmidt, F. Rittner, J. Teich, Daniel Ziener","doi":"10.1109/FCCM.2014.79","DOIUrl":"https://doi.org/10.1109/FCCM.2014.79","url":null,"abstract":"In this paper, we propose a self-adaptive FPGA-based, partially reconfigurable system for space missions in order to mitigate Single Event Upsets in the FPGA configuration and fabric. Dynamic reconfiguration is used here for an on-demand replication of modules in dependence of current and changing radiation levels. More precisely, the idea is to trigger a redundancy scheme such as Dual Modular Redundancy or Triple Modular Redundancy in response to a continuously monitored Single Event Upset rate measured inside the on-chip memories itself, e.g., any subset (even used) internal Block RAMs. Depending on the current radiation level, the minimal number of replicas is determined at runtime under the constraint that a required Safety Integrity Level for a module is ensured and configured accordingly. For signal processing applications it is shown that this autonomous adaption to the different solar conditions realizes a resource efficient mitigation. In our case study, we show that it is possible to triplicate the data throughput at the Solar Maximum condition (no flares) compared to a Triple Modular Redundancy implementation of a single module. We also show the decreasing Probability of Failures Per Hour by 2 × 104 at flare-enhanced conditions compared with a non-redundant system.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124882374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Speeding Up FPGA Placement: Parallel Algorithms and Methods 加速FPGA布局:并行算法和方法
Ma An, J. Gregory Steffan, Vaughn Betz
{"title":"Speeding Up FPGA Placement: Parallel Algorithms and Methods","authors":"Ma An, J. Gregory Steffan, Vaughn Betz","doi":"10.1109/FCCM.2014.60","DOIUrl":"https://doi.org/10.1109/FCCM.2014.60","url":null,"abstract":"Placement of a large FPGA design now commonly requires several hours, significantly hindering designer productivity. Furthermore, FPGA capacity is growing faster than CPU speed, which will further increase placement time unless new approaches are found. Multi-core processors are now ubiquitous, however, and some recent processors also have hardware support for transactional memory (TM), making parallelism an increasingly attractive approach for speeding up placement. We investigate methods to parallelize the simulated annealing placement algorithm in VPR, which is widely used in FPGA research. We explore both algorithmic changes and the use of different parallel programming paradigms and hardware, including TM, thread-level speculation (TLS) and lock-free techniques. We find that hardware TM enables large speedups (8.1x on average), but compromises “move fairness” and leads to an unacceptable quality loss. TLS scales poorly, with a maximum 2.2x speedup, but preserves quality. A new dependency checking parallel strategy achieves the best balance: the deterministic version achieves 5.9x speedup and no quality loss, while the non-deterministic, lock-free version can scale to a 34x speedup.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121991788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack 云中的fpga:使用OpenStack启动虚拟化硬件加速器
Stuart Byma, J. Steffan, H. Bannazadeh, Alberto Leon Garcia, P. Chow
{"title":"FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack","authors":"Stuart Byma, J. Steffan, H. Bannazadeh, Alberto Leon Garcia, P. Chow","doi":"10.1109/FCCM.2014.42","DOIUrl":"https://doi.org/10.1109/FCCM.2014.42","url":null,"abstract":"We present a new approach for integrating virtualized FPGA-based hardware accelerators into commercial-scale cloud computing systems, with minimal virtualization overhead. Partially reconfigurable regions across multiple FPGAs are offered as generic cloud resources through OpenStack (opensource cloud software), thereby allowing users to “boot” custom designed or predefined network-connected hardware accelerators with the same commands they would use to boot a regular Virtual Machine. We propose a hardware and software framework to enable this virtualization. This is a first attempt at closely fitting FPGAs into existing cloud computing models, where resources are virtualized, flexible, and have the illusion of infinite scalability. Our system can set up and tear down virtual accelerators in approximately 2.6 seconds on average, much faster than regular virtual machines. The static virtualization hardware on the physical FPGAs causes only a three cycle latency increase and a one cycle pipeline stall per packet in accelerators when compared to a non-virtualized system. We present a case study analyzing the design and performance of an application-level load balancer using a fully implemented prototype of our system. Our study shows that FPGA cloud compute resources can easily outperform virtual machines, while the system's virtualization and abstraction significantly reduces design iteration time and design complexity.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116311518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 229
Look-up Table Design for Deep Sub-threshold through Full-Supply Operation 全供深次阈值查表设计
M. Abusultan, S. Khatri
{"title":"Look-up Table Design for Deep Sub-threshold through Full-Supply Operation","authors":"M. Abusultan, S. Khatri","doi":"10.1109/FCCM.2014.80","DOIUrl":"https://doi.org/10.1109/FCCM.2014.80","url":null,"abstract":"Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA lookup table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80×) lower power and a (~4×) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel global (spacial) as well as local (random) PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA LUT to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8× and 2.9× respectively. The dynamic body biasing circuits incur a 3.49% area overhead when designed to each drive a cluster of 25 LUTs.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116788610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信