2018 International Conference on Field-Programmable Technology (FPT)最新文献

筛选
英文 中文
An Area-Efficient Out-of-Order Soft-Core Processor Without Register Renaming 无寄存器重命名的区域高效乱序软核处理器
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00077
J. Kadomoto, Toru Koizumi, A. Fukuda, Reoma Matsuo, Susumu Mashimo, Akifumi Fujita, Ryota Shioya, H. Irie, S. Sakai
{"title":"An Area-Efficient Out-of-Order Soft-Core Processor Without Register Renaming","authors":"J. Kadomoto, Toru Koizumi, A. Fukuda, Reoma Matsuo, Susumu Mashimo, Akifumi Fujita, Ryota Shioya, H. Irie, S. Sakai","doi":"10.1109/FPT.2018.00077","DOIUrl":"https://doi.org/10.1109/FPT.2018.00077","url":null,"abstract":"In this paper, we present an out-of-order soft-core processor adopting STRAIGHT architecture. STRAIGHT has a unique instruction format in which source operands are expressed as distances from producer instructions. This eliminates the need for register renaming and eliminates a register map table (RMT), which usually consists of a large multi-port RAM. That leads to small area, low power consumption, and high scalability of the front-end pipeline width. Moreover, the simplified architecture enables rapid miss-recovery. The prototype is implemented and evaluated on an FPGA. Compared to an out-of-order soft-core processor with a conventional RISC ISA, the proposed soft-core consumes 147-829 fewer LUTs for the front-end pipeline. The evaluation results show that the proposed soft-core is correctly operating on an FPGA, and estimated dynamic power consumption of the soft-core is 0.120 W.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123075802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Injecting FPGA Configuration Faults in Parallel 并行注入FPGA配置故障
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00037
Shane T. Fleming, David B. Thomas
{"title":"Injecting FPGA Configuration Faults in Parallel","authors":"Shane T. Fleming, David B. Thomas","doi":"10.1109/FPT.2018.00037","DOIUrl":"https://doi.org/10.1109/FPT.2018.00037","url":null,"abstract":"When using SRAM-based FPGA devices in safety critical applications testing against bitflips in the device configuration memory is essential. Often such tests are achieved by corrupting configuration memory bits of a running device, but this has many scalability, reliability, and flexibility challenges. In this paper, we present a framework and a concrete implementation of a parallel fault injection cluster that addresses these challenges. Scalability is addressed by using multiple identical FPGA devices, each testing a different region in parallel. Reliability is addressed by using reconfigurable system-on-chip devices, that are isolated from each other. Flexibility is addressed by using a pending commit structure, that continually checkpoints the overall experiment and allows elastic scaling. We test and showcase our approach by exhaustively flipping every bit in the configuration memory of the CHStone benchmark suite and a VivadoHLS generated k-means clustering image processing application. Our results show that: linear scaling is possible as the number of devices increases; the majority of error inducing bitflips in the k-means application do not significantly impact the output; and that the Xilinx Essential bits tool may miss some bits that can induce errors.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127520015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Tatum: Parallel Timing Analysis for Faster Design Cycles and Improved Optimization 并行时序分析,更快的设计周期和改进的优化
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00026
Kevin E. Murray, Vaughn Betz
{"title":"Tatum: Parallel Timing Analysis for Faster Design Cycles and Improved Optimization","authors":"Kevin E. Murray, Vaughn Betz","doi":"10.1109/FPT.2018.00026","DOIUrl":"https://doi.org/10.1109/FPT.2018.00026","url":null,"abstract":"Static Timing Analysis (STA) is used to evaluate the correctness and performance of a digital circuit implementation. In addition to final sign-off checks, STA is called numerous times during placement and routing to guide optimization. As a result, STA consumes a significant fraction of the time required for design implementation; to make progress reducing FPGA compile times we need faster STA. We evaluate the suitability of both GPU and multi-core CPU platforms for accelerating STA. On core STA algorithms our GPU kernel achieves a 6.2 times kernel speed-up but data transfer overhead reduces this to 0.9 times. Our best CPU implementation achieves a 9.2 times parallel speed-up on 32 cores, yielding a 15.2 times overall speed-up compared to the VPR analyzer, and a 6.9 times larger parallel speed-up than a recent parallel ASIC timing analyzer. We then show how reducing the run-time cost of STA can be leveraged to improve optimization quality, reducing critical path delay by 4%.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125678110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Evaluating The Highly-Pipelined Intel Stratix 10 FPGA Architecture Using Open-Source Benchmarks 使用开源基准测试评估高度流水线化的Intel Stratix 10 FPGA架构
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00038
Tian Tan, E. Nurvitadhi, D. Shih, Derek Chiou
{"title":"Evaluating The Highly-Pipelined Intel Stratix 10 FPGA Architecture Using Open-Source Benchmarks","authors":"Tian Tan, E. Nurvitadhi, D. Shih, Derek Chiou","doi":"10.1109/FPT.2018.00038","DOIUrl":"https://doi.org/10.1109/FPT.2018.00038","url":null,"abstract":"Intel Stratix 10 FPGAs offer a novel architectural feature called HyperFlex that enables an extreme degree of pipelining resulting in up to 1GHz clock frequencies. Prior work has evaluated HyperFlex on pre-production Stratix 10 FPGAs using internal designs not accessible to the general public. This paper presents an updated evaluation of HyperFlex on the latest publicly-available production Stratix 10 FPGA using open-source benchmarks. In particular, our evaluation started with seven RTL designs from existing open-source projects, carefully chosen to capture a variety of architectures (simple pipeline to pipeline with loop/M20Ks/DSPs) implementing well-known functions such as crypto, math, and image processing. An FPGA developer who was not an expert in HyperFlex then spent around 250 engineering hours to develop 24 optimized versions of these designs, following the Intel Stratix 10 FPGA HyperFlex optimization guide. Those optimized designs run at 400MHz to 850MHz. In this paper, we describe the optimizations, efforts, and results. Upon publication, those optimized designs will be open-sourced and published in GitHub.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116379729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Synthesizable Heterogeneous FPGA Fabrics 可合成异构FPGA结构
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00040
Brett Grady, J. Anderson
{"title":"Synthesizable Heterogeneous FPGA Fabrics","authors":"Brett Grady, J. Anderson","doi":"10.1109/FPT.2018.00040","DOIUrl":"https://doi.org/10.1109/FPT.2018.00040","url":null,"abstract":"We present an automated framework for the generation of synthesizable FPGAs with heterogeneous functional blocks and carry chains, as modelled with the open-source Verilog-to-Routing (VTR) FPGA architecture evaluation framework. VTR's modelling of hardened blocks, such as DSPs and BRAMs, is leveraged to generate synthesizeable FPGAs mappable via VTR's Verilog frontend. The generated Verilog source for the FPGA can be synthesized to target any conventional semiconductor process via an industry-standard ASIC toolflows with minimal implementation effort. We model a Stratix IV-style FPGA architecture, complete with carry chains, DSPs and BRAMs, and compare area/performance with the commercial Stratix IV FPGA. The area and performance gap between the fully synthesizable and commercial fabrics for a set of benchmarks using the heterogeneous blocks is 3.2x and 2.3x, respectively. Optimizations to reduce the gap are discussed.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121968149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Memory-Efficient Architecture for Accelerating Generative Networks on FPGA 基于FPGA加速生成网络的高效内存架构
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00016
Shuanglong Liu, Chenglong Zeng, Hongxiang Fan, Ho-Cheung Ng, Jiuxi Meng, Zhiqiang Que, Xinyu Niu, W. Luk
{"title":"Memory-Efficient Architecture for Accelerating Generative Networks on FPGA","authors":"Shuanglong Liu, Chenglong Zeng, Hongxiang Fan, Ho-Cheung Ng, Jiuxi Meng, Zhiqiang Que, Xinyu Niu, W. Luk","doi":"10.1109/FPT.2018.00016","DOIUrl":"https://doi.org/10.1109/FPT.2018.00016","url":null,"abstract":"Generative adversarial networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks: a generative network (generator) and a discriminative network (discriminator). These two networks compete with each other to perform better at their respective tasks. The generator is typically a deconvolutional neural network and the discriminator is a convolutional neural network (CNN). Deconvolution performs a fundamentally new type of mathematical operation which differs from convolution. While the FPGA-based CNN accelerators have been widely studied in prior work, the acceleration of deconvolutional networks on FPGA is rarely explored. This paper proposes a novel parametrized deconvolutional architecture based on an FPGA-friendly method, in contrast to the transposed convolution implementation in CPUs and GPUs. Hardware design templates which map this architecture to FPGAs are provided with configurable deconvolutional layer parameters. Furthermore, a memory-efficient architecture with a new tiling method is proposed to accelerate the generator of GANs, by storing all intermediate data in on-chip memories and significantly reducing off-chip data transfers. The performance of the proposed accelerator is evaluated using a variety of GANs on a Xilinx Zynq 706 board, which shows 2.3x higher speed and 8.2x off-chip memory access reduction than an optimized Vanilla FPGA design. Compared to the respective implementations on CPUs and GPUs, the achieved improvements are in the range of 30x-92x in speed over an Intel 8-core i7-950 CPU, and 8x-108x in terms of Performance-per-Watt over an NVIDIA Titan X GPU.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122146938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
DaCO: A High-Performance Token Dataflow Coprocessor Overlay for FPGAs fpga的高性能令牌数据流协处理器覆盖层
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00032
Siddhartha, Nachiket Kapre
{"title":"DaCO: A High-Performance Token Dataflow Coprocessor Overlay for FPGAs","authors":"Siddhartha, Nachiket Kapre","doi":"10.1109/FPT.2018.00032","DOIUrl":"https://doi.org/10.1109/FPT.2018.00032","url":null,"abstract":"Dataflow computing architectures exploit dynamic parallelism at the fine granularity of individual operations and provide a pathway to overcome the performance and energy limits of conventional von Neumann models. In this vein, we present DaCO (Dataflow Coprocessor FPGA Overlay), a high-performance compute organization for FPGAs to deliver up to 2.5x speedup over existing dataflow alternatives. Historically, dataflow-style execution has been viewed as an attractive parallel computing paradigm due to the self-timed, decentralized nature of implementation of dataflow dependencies and an absence of sequential program counters. However, realising high-performance dataflow computers has remained elusive largely due to the complexity of scheduling this parallelism and data communication bottlenecks. DaCO achieves this by (1) supporting large-scale (1000s of nodes) out-of-order scheduling using hierarchical lookup, (2) priority-aware routing of dataflow dependencies using the efficient Hoplite-Q NoC, and (3) clustering techniques to exploit data locality in the communication network organization. Each DaCO processing element is a programmable soft processor and it communicates with others using a packet-switching network-on-chip (PSNoC). We target the Arria 10 AX115S FPGA to take advantage of the hard floating-point DSP blocks, and maximize performance by multipumping the M20K Block RAMs. Overall, we can scale DaCO to 450 processors operating at an fmax of 250 MHz on the target platform. Each soft processor consumes 779 ALMs, 4 M20K BRAMs, and 3 hard floating-point DSP blocks for optimum balance, while the on-chip communication framework consumes < 15% of the on-chip resources.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130442940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGA Architecture Enhancements for Efficient BNN Implementation 有效实现BNN的FPGA架构增强
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00039
Jin Hee Kim, Jongeun Lee, J. Anderson
{"title":"FPGA Architecture Enhancements for Efficient BNN Implementation","authors":"Jin Hee Kim, Jongeun Lee, J. Anderson","doi":"10.1109/FPT.2018.00039","DOIUrl":"https://doi.org/10.1109/FPT.2018.00039","url":null,"abstract":"Binarized neural networks (BNNs) are ultra-reduced precision neural networks, having weights and activations restricted to single-bit values. BNN computations operate on bitwise data, making them particularly amenable to hardware implementation. In this paper, we first analyze BNN implementations on contemporary commercial 20nm FPGAs. We then propose two lightweight architectural changes that significantly improve the logic density of FPGA BNN implementations. The changes involve incorporating additional carry-chain circuitry into logic elements, where the additional circuitry is connected in a specific way to benefit BNN computations. The architectural changes are evaluated in the context of state-of-the-art Intel and Xilinx FPGAs and shown to provide over 2x area reduction in the key BNN computational task (the XNOR-popcount sub-circuit), at a modest performance cost of less than 2%.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116826955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Nearest Neighbor Search Engine Using Distance-Based Hashing 基于距离哈希的最近邻搜索引擎
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00031
Toshitaka Ito, Yuri Itotani, S. Wakabayashi, Shinobu Nagayama, Masato Inagi
{"title":"A Nearest Neighbor Search Engine Using Distance-Based Hashing","authors":"Toshitaka Ito, Yuri Itotani, S. Wakabayashi, Shinobu Nagayama, Masato Inagi","doi":"10.1109/FPT.2018.00031","DOIUrl":"https://doi.org/10.1109/FPT.2018.00031","url":null,"abstract":"This paper proposes an FPGA-based nearest neighbor search engine for high-dimensional data, in which nearest neighbor search is performed based on distance-based hashing. The proposed hardware search engine implements a nearest neighbor search algorithm based on an extension of flexible distance-based hashing (FDH, for short), which finds an exact solution with high probability. The proposed engine is a parallel processing and pipelined circuit so that search results can be obtained in a short execution time. Experimental results show the effectiveness and efficiency of the proposed engine.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Software-Specified FPGA Accelerators for Elementary Functions 用于基本功能的软件指定FPGA加速器
2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI: 10.1109/FPT.2018.00019
J. Chen, Xue Liu, J. Anderson
{"title":"Software-Specified FPGA Accelerators for Elementary Functions","authors":"J. Chen, Xue Liu, J. Anderson","doi":"10.1109/FPT.2018.00019","DOIUrl":"https://doi.org/10.1109/FPT.2018.00019","url":null,"abstract":"We use a high-level synthesis (HLS) methodology for the design of hardware accelerators for two elementary functions: reciprocal and square root. The functions are described in C-language software and synthesized into Verilog RTL using the LegUp HLS tool from the University of Toronto [1]. The accelerators are designed to deliver high accuracy, and provide less than 1 ULP error in comparison with GNU software (math.h). Through changes to the HLS constraints, hardware implementations with different speed/area trade-offs can be generated rapidly. In an experimental study, our HLS-generated accelerators are targeted to the Altera/Intel Cyclone V FPGA and compared with hand-designed cores from the FPGA vendor. Results show that our cores offer considerably better resource usage (area) (i.e. ALMs, DSPs, memory bits), while commercial cores operate at a modestly higher FMax.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信