2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

筛选
英文 中文
A LUT-Based Approximate Adder 基于lut的近似加法器
Andreas Becher, Jorge Echavarria, Daniel Ziener, S. Wildermann, J. Teich
{"title":"A LUT-Based Approximate Adder","authors":"Andreas Becher, Jorge Echavarria, Daniel Ziener, S. Wildermann, J. Teich","doi":"10.1109/FCCM.2016.16","DOIUrl":"https://doi.org/10.1109/FCCM.2016.16","url":null,"abstract":"In this paper, we propose a novel approximate adder structure for LUT-based FPGA technology. Compared with a full featured accurate carry-ripple adder, the longest path is significantly shortened which enables the clocking with an increased clock frequency. By using the proposed adder structure, the throughput of an FPGA-based implementation can be significantly increased. On the other hand, the resulting average error can be reduced compared to similar approaches for ASIC implementations.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123039851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Evaluating Embedded FPGA Accelerators for Deep Learning Applications 评估用于深度学习应用的嵌入式FPGA加速器
Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, Vamsi Buddha, Nachiket Kapre
{"title":"Evaluating Embedded FPGA Accelerators for Deep Learning Applications","authors":"Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, Vamsi Buddha, Nachiket Kapre","doi":"10.1109/FCCM.2016.14","DOIUrl":"https://doi.org/10.1109/FCCM.2016.14","url":null,"abstract":"FPGA-based embedded soft vector processors can exceed the performance and energy-efficiency of embedded GPUs and DSPs for lightweight deep learning applications. For low complexity deep neural networks targeting resource constrained platforms, we develop optimized Caffe-compatible deep learning library routines that target a range of embedded accelerator-based systems between 4 -- 8 W power budgets such as the Xilinx Zedboard (with MXP soft vector processor), NVIDIA Jetson TK1 (GPU), InForce 6410 (DSP), TI EVM5432 (DSP) as well as the Adapteva Parallella board (custom multi-core with NoC). For MNIST (28×28 images) and CIFAR10 (32×32 images), the deep layer structure is amenable to MXP-enhanced FPGA mappings to deliver 1.4 -- 5× higher energy efficiency than all other platforms. Not surprisingly, embedded GPU works better for complex networks with large image resolutions.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125640498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Spatial Predicates Evaluation in the Geohash Domain Using Reconfigurable Hardware 基于可重构硬件的Geohash域空间谓词评估
Dajung Lee, R. Moussalli, S. Asaad, M. Srivatsa
{"title":"Spatial Predicates Evaluation in the Geohash Domain Using Reconfigurable Hardware","authors":"Dajung Lee, R. Moussalli, S. Asaad, M. Srivatsa","doi":"10.1109/FCCM.2016.51","DOIUrl":"https://doi.org/10.1109/FCCM.2016.51","url":null,"abstract":"As location sensing devices are becoming ubiquitous, overwhelming amounts of data are being produced by the Internet-of-Things-That-Move. Though analyzing this data presents significant business opportunities, new techniques are needed to attain adequate levels of processing performance. One example is the recently introduced geohash geographical coordinate system that is mainly used for indexing. While geohash codes provide useful inherent properties such as hierarchical and variable-precision coding, traditional spatial algorithms operate on data represented using the conventional latitude/longitude geographical coordinate system, and as such do not take advantage of geohash coding. This paper tackles the evaluation of spatial predicates on geometries defined in the geohash domain, as an alternative to the standard Dimensionally Extended Nine-Intersection Model (DE-9IM). We present the first hardware architecture to efficiently evaluate \"contain\" and \"touch\" (internal, external, corner) relations between streams of pairs of geohash codes, in a high throughput (no stall) fashion. Employing FPGAs for exploiting the bit-level granularity of geohash codes, experimental results show (end-to-end) speedup of more than 20× and 90× over highly optimized single-threaded DE-9IM implementations of the contain and touch predicates, respectively. Furthermore, the PCIe-bound FPGA-based solution outperforms a geohash-based multithreaded CPU implementation by ≈1.8× (touch predicate) while using minimal FPGA resources.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114099172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Finding Space-Time Stream Permutations for Minimum Memory and Latency 寻找最小内存和延迟的时空流排列
Thaddeus Koehn, P. Athanas
{"title":"Finding Space-Time Stream Permutations for Minimum Memory and Latency","authors":"Thaddeus Koehn, P. Athanas","doi":"10.1109/FCCM.2016.54","DOIUrl":"https://doi.org/10.1109/FCCM.2016.54","url":null,"abstract":"Processing of parallel data streams requires permutation units for many algorithms where the streams are not independent. Such algorithms include transforms, multi-rate signal processing, and Viterbi decoding. The absolute order of data elements from the permutation is not important, only that data elements are located correctly for the next processing step. This paper describes a method to find permutations that require a minimum amount of memory and latency. The required permutations are generated based on the data dependencies of a computation set. Additional constraints are imposed so that the parallel streaming architecture processes the data without flow control. Results show agreement with brute force methods, which become computationally infeasible for large permutation sets.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"30 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115796634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Continuous Online Self-Monitoring Introspection Circuitry for Timing Repair by Incremental Partial-Reconfiguration (COSMIC TRIP) 增量部分重构定时修复连续在线自监测自省电路(COSMIC TRIP)
Hans Giesen, Benjamin Gojman, Raphael Rubin, Ji Kim, A. DeHon
{"title":"Continuous Online Self-Monitoring Introspection Circuitry for Timing Repair by Incremental Partial-Reconfiguration (COSMIC TRIP)","authors":"Hans Giesen, Benjamin Gojman, Raphael Rubin, Ji Kim, A. DeHon","doi":"10.1145/3158229","DOIUrl":"https://doi.org/10.1145/3158229","url":null,"abstract":"We show that continuously monitoring on-chip delays at the LUT-to-LUT link level during operation allows an FPGA to detect and self-adapt to aging and environmental effects on timing. Using a lightweight (<;4% added area) mechanism for monitoring transition timing, a Difference Detector with First-Fail Latch, we can estimate the timing margin on circuits and identify the individual links that have degraded and whose delay is determining the worst-case circuit delay. Combined with Choose-Your-own-Adventure precomputed, fine-grained repair alternatives, we introduce a strategy for rapid, in-system incremental repair of links with degraded timing. We show that these techniques allow us to respond to a single aging event in less than 300 ms for the toronto20 benchmarks. The result is a step toward systems where adaptive reconfiguration on the time-scale of seconds is viable and beneficial.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116072188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication 全流水线的能量效率:一个矩阵乘法的案例研究
Peipei Zhou, Hyunseok Park, Zhenman Fang, J. Cong, A. DeHon
{"title":"Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication","authors":"Peipei Zhou, Hyunseok Park, Zhenman Fang, J. Cong, A. DeHon","doi":"10.1109/FCCM.2016.50","DOIUrl":"https://doi.org/10.1109/FCCM.2016.50","url":null,"abstract":"Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1 due to interconnect savings, but II=1 always achieves energy close to the minimum. We also identify sources of inefficient mapping in the commercial tool flow.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123185976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Cost Effective Partial Scan for Hardware Emulation 成本有效的部分扫描硬件仿真
Tao Li, Qiang Liu
{"title":"Cost Effective Partial Scan for Hardware Emulation","authors":"Tao Li, Qiang Liu","doi":"10.1109/FCCM.2016.39","DOIUrl":"https://doi.org/10.1109/FCCM.2016.39","url":null,"abstract":"FPGA-based hardware emulation platform runs significantly faster than software simulation for verifying complex circuit designs. However, the controllability and observability of circuit internal signals mapped onto FPGAs are restricted due to the limited chip pins. Scan chain-based technique is effective in providing full-chip controllability and observability, at the cost of large area overhead, especially for FPGAs. Therefore, partial scan has been proposed as an alternative way to improve the controllability and observability while reducing the area cost. However, the optimized partial scan solution with the minimum number of scan flip-flops is not always found. This paper formulates the classical balanced structure partial scan procedure in one step as an integer linear programming problem, leading to the optimized partial scan solution. In addition, partially used logic resources in FPGAs are exploited to implement the extra logic required by the scan chain, to further reduce the area cost. Experimental results show that our partial scan approach can reduce the area overhead by 78.6% and 16.6% compared to the full scan and the existing partial scan approach.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"7 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120853874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Power-Efficient Accelerated Genomic Short Read Mapping on Heterogeneous Computing Platforms 异构计算平台上高效加速基因组短读映射
Ernst Houtgast, V. Sima, G. Marchiori, K. Bertels, Z. Al-Ars
{"title":"Power-Efficient Accelerated Genomic Short Read Mapping on Heterogeneous Computing Platforms","authors":"Ernst Houtgast, V. Sima, G. Marchiori, K. Bertels, Z. Al-Ars","doi":"10.1109/FCCM.2016.17","DOIUrl":"https://doi.org/10.1109/FCCM.2016.17","url":null,"abstract":"We propose a novel FPGA-accelerated BWA-MEM implementation, a popular tool for genomic data mapping. The performance and power-efficiency of the FPGA implementation on the single Xilinx Virtex-7 Alpha Data add-in card is compared against a software-only baseline system. By offloading the Seed Extension phase onto the FPGA, a two-fold speedup in overall application-level performance is achieved and a 1.6x gain in power-efficiency. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 34 thousand base pairs per Joule.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"206 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125688409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Vertex-Centric Graph Processing on FPGA 基于FPGA的顶点中心图处理
Nina Engelhardt, Hayden Kwok-Hay So
{"title":"Vertex-Centric Graph Processing on FPGA","authors":"Nina Engelhardt, Hayden Kwok-Hay So","doi":"10.1109/FCCM.2016.31","DOIUrl":"https://doi.org/10.1109/FCCM.2016.31","url":null,"abstract":"Past research and implementation efforts have shown that FPGAs are efficient at processing many graph algorithms. However, they are notoriously hard to program, leading to impractically long development times even for simple applications. We propose a vertex-centric framework for graph processing on FPGAs, providing a base execution model and distributed architecture so that developers need only write very small application kernels.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127765494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
DeCO: A DSP Block Based FPGA Accelerator Overlay with Low Overhead Interconnect DeCO:一种基于DSP块的FPGA加速器覆盖和低开销互连
A. Jain, Xiangwei Li, P. Singhai, D. Maskell, Suhaib A. Fahmy
{"title":"DeCO: A DSP Block Based FPGA Accelerator Overlay with Low Overhead Interconnect","authors":"A. Jain, Xiangwei Li, P. Singhai, D. Maskell, Suhaib A. Fahmy","doi":"10.1109/FCCM.2016.10","DOIUrl":"https://doi.org/10.1109/FCCM.2016.10","url":null,"abstract":"Coarse-grained FPGA overlay architectures paired with general purpose processors offer a number of advantages for general purpose hardware acceleration because of software-like programmability, fast compilation, application portability, and improved design productivity. However, the area overheads of these overlays, and in particular architectures with island-style interconnect, negate many of these advantages, preventing their use in practical FPGA-based systems. Crucially, the interconnect flexibility provided by these overlay architectures is normally over-provisioned for accelerators based on feed-forward pipelined datapaths, which in many cases have the general shape of inverted cones. We propose DeCO, a cone shaped cluster of FUs utilizing a simple linear interconnect between them. This reduces the area overheads for implementing compute kernels extracted from compute-intensive applications represented as directed acyclic dataflow graphs, while still allowing high data throughput. We perform design space exploration by modeling programmability overhead as a function of overlay design parameters, and compare to the programmability overhead of island-style overlays. We observe 87% savings in LUT requirements using the proposed approach compared to DSP block based island-style overlays. Our experimental evaluation shows that the proposed overlay exhibits an achievable frequency of 395 MHz, close to the DSP theoretical limit on the Xilinx Zynq. We also present an automated tool flow that provides a rapid and vendor-independent mapping of the high level compute kernel code to the proposed overlay.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116889538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信