2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

筛选
英文 中文
Scheduling Considerations for Voter Checking in TMR-MER Systems TMR-MER系统中选民检查的调度考虑
N. T. H. Nguyen, E. Çetin, O. Diessel
{"title":"Scheduling Considerations for Voter Checking in TMR-MER Systems","authors":"N. T. H. Nguyen, E. Çetin, O. Diessel","doi":"10.1109/FCCM.2017.17","DOIUrl":"https://doi.org/10.1109/FCCM.2017.17","url":null,"abstract":"Field-Programmable Gate Arrays (FPGAs) are susceptible to radiation-induced Single Event Upsets (SEUs). A common technique for dealing with SEUs is Triple Modular Redundancy (TMR) combined with Module-based configuration memory Error Recovery (MER). By triplicating components and voting on their outputs, TMR helps localize the configuration memory errors, and by reconfiguring the faulty component, MER swiftly corrects the errors. However, the order in which the voters of TMR components are checked has an inevitable impact on the overall system reliability. In this paper, we outline an approach for computing the reliability of TMR-MER systems that consist of finitely many components. Using the derived reliability models we demonstrate that the system reliability is improved when the critical components are checked more frequently for the presence of configuration memory errors than when they are checked in round-robin order. We propose a genetic algorithm for finding a voter checking schedule that maximizes system reliability for systems consisting of finitely many TMR components. Simulation results indicate that the mean time to failure of TMR-MER systems can be increased by up to 100% when Variable-Rate Voter Checking (VRVC) rather than round robin, is used. We show that the power used to eliminate configuration memory errors in an exemplar TMR-MER system employing VRVC is reduced while system reliability remains high. We also demonstrate that errors can be detected 30% faster on average when the system employs VRVC instead of round robin for voter checking.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126940159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Enabling Long Debug Traces of HLS Circuits Using Bandwidth-Limited Off-Chip Storage Devices 利用带宽有限的片外存储设备实现HLS电路的长调试跟踪
Jeffrey B. Goeders
{"title":"Enabling Long Debug Traces of HLS Circuits Using Bandwidth-Limited Off-Chip Storage Devices","authors":"Jeffrey B. Goeders","doi":"10.1109/FCCM.2017.29","DOIUrl":"https://doi.org/10.1109/FCCM.2017.29","url":null,"abstract":"High-level synthesis (HLS) has gained considerable traction in recent years. Despite considerable strides in the development of quality HLS compilers, one area that is often cited as a barrier to HLS adoption is the difficulty in debugging HLS produced circuits. Recent academic work has presented techniques that use on-chip memories to efficiently record execution of HLS circuits, and map the captured data back to the original source code to provide the user with a software-like debug experience. However, limited on-chip memory results in very short debug traces, which may force a designer to spend multiple debug iterations to resolve complicated bugs. In this work we present techniques to enable off-chip capture of HLS debug information. While off-chip storage does not suffer from the capacity limitations of on-chip memory, its usage introduces a new challenge: limited bandwidth. In this work we show how information from within the HLS flow can be leveraged to generated a streamed debug trace within given bandwidth constraints. For a bandwidth limited interface, we show that our techniques allow the user to observe 19x more source code variables than using a basic approach.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131897741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Rapid Application Development with Python for Heterogeneous Processor-Based FPGAs 基于异构处理器的fpga的Python快速应用开发评估
A. Schmidt, G. Weisz, M. French
{"title":"Evaluating Rapid Application Development with Python for Heterogeneous Processor-Based FPGAs","authors":"A. Schmidt, G. Weisz, M. French","doi":"10.1109/FCCM.2017.45","DOIUrl":"https://doi.org/10.1109/FCCM.2017.45","url":null,"abstract":"As modern FPGAs evolve to include more heterogeneous processing elements, such as ARM cores, it makes sense to consider these devices as processors first and FPGA accelerators second. As such, the conventional FPGA development environment must also adapt to support more software-like programming functionality. While high-level synthesis tools can help reduce FPGA development time, there still remains a large expertise gap in order to realize highly performing implementations. At a system-level the skill set necessary to integrate multiple custom IP hardware cores, interconnects, memory interfaces, and now heterogeneous processing elements is complex. Rather than drive FPGA development from the hardware up, we consider the impact of leveraging Python to accelerate application development. Python offers highly optimized libraries from an incredibly large developer community, yet is limited to the performance of the hardware system. In this work we evaluate the impact of using PYNQ, a Python development environment for application development on the Xilinx Zynq devices, the performance implications, and bottlenecks associated with it. We compare our results against existing C-based and hand-coded implementations to better understand if Python can be the glue that binds together software and hardware developers.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134585538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Using Runahead Execution to Hide Memory Latency in High Level Synthesis 在高级合成中使用提前执行来隐藏内存延迟
Shane T. Fleming, David B. Thomas
{"title":"Using Runahead Execution to Hide Memory Latency in High Level Synthesis","authors":"Shane T. Fleming, David B. Thomas","doi":"10.1109/FCCM.2017.33","DOIUrl":"https://doi.org/10.1109/FCCM.2017.33","url":null,"abstract":"Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"103 32","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113945472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ParaDiMe: A Distributed Memory FPGA Router Based on Speculative Parallelism and Path Encoding ParaDiMe:一种基于推测并行和路径编码的分布式存储器FPGA路由器
Chin Hau Hoo, Akash Kumar
{"title":"ParaDiMe: A Distributed Memory FPGA Router Based on Speculative Parallelism and Path Encoding","authors":"Chin Hau Hoo, Akash Kumar","doi":"10.1109/FCCM.2017.34","DOIUrl":"https://doi.org/10.1109/FCCM.2017.34","url":null,"abstract":"The increase in speed and capacity of FPGAs is faster than the development of effective design tools to fully utilize it, and routing of nets remains as one of the most time-consuming stages of the FPGA design flow. While existing works have proposed methods of accelerating routing through parallelization, they are limited by the memory architecture of the system that they target. In this paper, we propose a distributed memory parallel FPGA router called ParaDiMe to address the limitations of existing works. ParaDiMe speculatively routes net in parallel and dynamically detects the need to reduce the number of active processes in order to achieve convergence. In addition, the synchronization overhead in ParaDiMe is significantly reduced through a careful design of the messaging protocol where paths to sinks are encoded in a space-efficient manner. Moreover, the frequency of synchronization is tuned to ensure convergence while minimizing the communication overhead. Compared to VTR, ParaDiMe achieves an average speedup of 19.8X with 32 processes while producing similar quality of results.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116319895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Bit-Width Based Resource Partitioning for CNN Acceleration on FPGA 基于位宽的FPGA CNN加速资源划分
Jianxin Guo, S. Yin, P. Ouyang, Leibo Liu, Shaojun Wei
{"title":"Bit-Width Based Resource Partitioning for CNN Acceleration on FPGA","authors":"Jianxin Guo, S. Yin, P. Ouyang, Leibo Liu, Shaojun Wei","doi":"10.1109/FCCM.2017.13","DOIUrl":"https://doi.org/10.1109/FCCM.2017.13","url":null,"abstract":"Convolutional neural networks (CNNs) haveachieved great success in many applications. Recently, variousFPGA-based accelerators have been proposed to improve theperformance of CNNs. However, current most FPGA-basedmethods use single bit-width selection for all CNN layers, which lead to very low resource utilization efficiency anddifficulty in further performance improvement. In this paper, we propose a new approach utilizing bit-width partitioning ofFPGA DSP resources to improve the performance andresource utilization efficiency of CNN accelerator. Moreover, we use optimization approach to find the optimal allocationplan for DSP resources. On a Xilinx Virtex-7 FPGA, ourdesign approach achieves performance over the state-of-the-artFPGA-based CNN accelerators from 5.48x to 7.25x and by6.21x on average, when we evaluate the popular CNNs.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115683514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Efficient Particle-Grid Space Interpolation of an FPGA-Accelerated Particle-in-Cell Plasma Simulation fpga加速粒子胞内等离子体模拟的高效粒子网格空间插值
Almomany Abedalmuhdi, B. Wells, K. Nishikawa
{"title":"Efficient Particle-Grid Space Interpolation of an FPGA-Accelerated Particle-in-Cell Plasma Simulation","authors":"Almomany Abedalmuhdi, B. Wells, K. Nishikawa","doi":"10.1109/FCCM.2017.63","DOIUrl":"https://doi.org/10.1109/FCCM.2017.63","url":null,"abstract":"This paper highlights on-going research to effectively utilize a commercially available spatially reconfigurable platform and the OpenCL framework to improve the run-time performance and reduce the overall energy consumption of an existing 2.5D Electrostatic Particle-in-Cell type plasma simulation. This problem is constrained by the finite internal FPGA resources and the performance mandate that all main OpenCL kernels for this application reside in a single FPGA image. The paper focuses on solving the particle-to-grid space interpolation phase of the simulation because of its inherent nondeterministic global memory access pattern. The implementation that is presented adheres closely to the original CPU-based model while employing local memory, task level pipelining, and replication of kernel resources to provide a much more deterministic and coalesced access pattern. The overall simulation has been shown to have an approximately 2.5-fold improvement in performance and a eight-fold improvement in energy consumption over the life of the simulation when compared to the reference single core CPU implementation.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123644371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A Parameterizable Activation Function Generator for FPGA-Based Neural Network Applications 基于fpga的神经网络应用的可参数化激活函数生成器
S. M. H. Ho, C.-H. Dominic Hung, Ho-Cheung Ng, Maolin Wang, Hayden Kwok-Hay So
{"title":"A Parameterizable Activation Function Generator for FPGA-Based Neural Network Applications","authors":"S. M. H. Ho, C.-H. Dominic Hung, Ho-Cheung Ng, Maolin Wang, Hayden Kwok-Hay So","doi":"10.1109/FCCM.2017.40","DOIUrl":"https://doi.org/10.1109/FCCM.2017.40","url":null,"abstract":"Neural network applications on FPGAs at times require arithmetic operators that are either not available in the manufacturer's core library, or are complex operators made up of several elementary functions, requiring more resources than if they were built as single operators. In this work, we built an open-source, parameterized floating-point core generator named NnCore, for operators used as activation functions, and their derivatives. We propose a binary search algorithm to search for minimax-polynomial segments, with adjusting steps for ensuring monotonicity between different segments.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122650710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off fpga加速密集线性机器学习:精度收敛的权衡
Kaan Kara, Dan Alistarh, G. Alonso, O. Mutlu, Ce Zhang
{"title":"FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off","authors":"Kaan Kara, Dan Alistarh, G. Alonso, O. Mutlu, Ce Zhang","doi":"10.1109/FCCM.2017.39","DOIUrl":"https://doi.org/10.1109/FCCM.2017.39","url":null,"abstract":"Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme—called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"327 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122739371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Fast and Energy-Driven Design Space Exploration for Heterogeneous Architectures 异构架构的快速和能量驱动设计空间探索
Baptiste Roux, M. Gautier, O. Sentieys, J. Delahaye
{"title":"Fast and Energy-Driven Design Space Exploration for Heterogeneous Architectures","authors":"Baptiste Roux, M. Gautier, O. Sentieys, J. Delahaye","doi":"10.1109/FCCM.2017.31","DOIUrl":"https://doi.org/10.1109/FCCM.2017.31","url":null,"abstract":"In the last years, the integration of specialized hardware accelerators in Multiprocessor System-on-Chip (MpSoC) led to a new kind of architectures combining both software (SW) and hardware (HW) computational resources. For these new Heterogeneous MpSoC (HMpSoC) architectures, performance and energy consumption depend on a large set of parameters such as the HW/SW partitioning, the type of HW implementation or the communication cost. Design Space Exploration (DSE) consists in adjusting these parameters while monitoring a set of metrics (execution time, power, energy efficiency) to find the best mapping of the application on the targeted architecture. With the shift from performance-aware to energy-aware designs, computer-aided design and development tools try to reduce the large design space by simplifying HW/SW mapping mechanisms. However, energy consumption is not well supported in most of DSE tools due to the difficulty to fast and accurately estimate the energy consumption. To this aim, this work introduces a DSE method based on an analytical power model to circumvent the computation time bottleneck of state-of-the-art DSE methods. This exploration method proposes to optimize the HW/SW partitioning and mapping under user-defined objectives, especially an energy constraint. It targets tiling-based parallel applications and relies on an analytical power model that provides the DSE framework with the execution time and energy of a HW/SW configuration. The power model parameters are obtained with the measurements of a tiny subset of the design space, which are then injected into two extraction functions to obtain analytical formulations of the execution time and the energy consumption of the computation kernel. The partitioning problem constraints are defined as a set of inequalities with Boolean, integer (discrete) and non-integer (continuous) variables within a Mixed Integer Linear Programming (MILP) framework. Then, the best configuration that minimizes the user objective (e.g. execution time or total energy consumption) can be efficiently determined using commercial or open source solvers within a second. This methodology was tested on a Zynq-based heterogeneous architecture with two application kernels: a matrix multiplication and a Stencil computation. The results show a minimum of 12% acceleration speed-up and energy saving compared to standard approaches. They also show that the most energy-efficient solution is application-and platform-dependent and moreover hardly predictable. Such method could be included in a complete framework with a multi-step exploration to obtain an energy-efficient mapping of a full application on HMpSoC and to open new opportunity for future computer-aided design tools.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122056726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信