2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献_第2页

Scheduling Considerations for Voter Checking in TMR-MER Systems TMR-MER系统中选民检查的调度考虑

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.17

N. T. H. Nguyen, E. Çetin, O. Diessel

{"title":"Scheduling Considerations for Voter Checking in TMR-MER Systems","authors":"N. T. H. Nguyen, E. Çetin, O. Diessel","doi":"10.1109/FCCM.2017.17","DOIUrl":"https://doi.org/10.1109/FCCM.2017.17","url":null,"abstract":"Field-Programmable Gate Arrays (FPGAs) are susceptible to radiation-induced Single Event Upsets (SEUs). A common technique for dealing with SEUs is Triple Modular Redundancy (TMR) combined with Module-based configuration memory Error Recovery (MER). By triplicating components and voting on their outputs, TMR helps localize the configuration memory errors, and by reconfiguring the faulty component, MER swiftly corrects the errors. However, the order in which the voters of TMR components are checked has an inevitable impact on the overall system reliability. In this paper, we outline an approach for computing the reliability of TMR-MER systems that consist of finitely many components. Using the derived reliability models we demonstrate that the system reliability is improved when the critical components are checked more frequently for the presence of configuration memory errors than when they are checked in round-robin order. We propose a genetic algorithm for finding a voter checking schedule that maximizes system reliability for systems consisting of finitely many TMR components. Simulation results indicate that the mean time to failure of TMR-MER systems can be increased by up to 100% when Variable-Rate Voter Checking (VRVC) rather than round robin, is used. We show that the power used to eliminate configuration memory errors in an exemplar TMR-MER system employing VRVC is reduced while system reliability remains high. We also demonstrate that errors can be detected 30% faster on average when the system employs VRVC instead of round robin for voter checking.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126940159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Enabling Long Debug Traces of HLS Circuits Using Bandwidth-Limited Off-Chip Storage Devices 利用带宽有限的片外存储设备实现HLS电路的长调试跟踪

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.29

Jeffrey B. Goeders

{"title":"Enabling Long Debug Traces of HLS Circuits Using Bandwidth-Limited Off-Chip Storage Devices","authors":"Jeffrey B. Goeders","doi":"10.1109/FCCM.2017.29","DOIUrl":"https://doi.org/10.1109/FCCM.2017.29","url":null,"abstract":"High-level synthesis (HLS) has gained considerable traction in recent years. Despite considerable strides in the development of quality HLS compilers, one area that is often cited as a barrier to HLS adoption is the difficulty in debugging HLS produced circuits. Recent academic work has presented techniques that use on-chip memories to efficiently record execution of HLS circuits, and map the captured data back to the original source code to provide the user with a software-like debug experience. However, limited on-chip memory results in very short debug traces, which may force a designer to spend multiple debug iterations to resolve complicated bugs. In this work we present techniques to enable off-chip capture of HLS debug information. While off-chip storage does not suffer from the capacity limitations of on-chip memory, its usage introduces a new challenge: limited bandwidth. In this work we show how information from within the HLS flow can be leveraged to generated a streamed debug trace within given bandwidth constraints. For a bandwidth limited interface, we show that our techniques allow the user to observe 19x more source code variables than using a basic approach.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131897741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating Rapid Application Development with Python for Heterogeneous Processor-Based FPGAs 基于异构处理器的fpga的Python快速应用开发评估

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.45

A. Schmidt, G. Weisz, M. French

{"title":"Evaluating Rapid Application Development with Python for Heterogeneous Processor-Based FPGAs","authors":"A. Schmidt, G. Weisz, M. French","doi":"10.1109/FCCM.2017.45","DOIUrl":"https://doi.org/10.1109/FCCM.2017.45","url":null,"abstract":"As modern FPGAs evolve to include more heterogeneous processing elements, such as ARM cores, it makes sense to consider these devices as processors first and FPGA accelerators second. As such, the conventional FPGA development environment must also adapt to support more software-like programming functionality. While high-level synthesis tools can help reduce FPGA development time, there still remains a large expertise gap in order to realize highly performing implementations. At a system-level the skill set necessary to integrate multiple custom IP hardware cores, interconnects, memory interfaces, and now heterogeneous processing elements is complex. Rather than drive FPGA development from the hardware up, we consider the impact of leveraging Python to accelerate application development. Python offers highly optimized libraries from an incredibly large developer community, yet is limited to the performance of the hardware system. In this work we evaluate the impact of using PYNQ, a Python development environment for application development on the Xilinx Zynq devices, the performance implications, and bottlenecks associated with it. We compare our results against existing C-based and hand-coded implementations to better understand if Python can be the glue that binds together software and hardware developers.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134585538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Using Runahead Execution to Hide Memory Latency in High Level Synthesis 在高级合成中使用提前执行来隐藏内存延迟

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.33

Shane T. Fleming, David B. Thomas

{"title":"Using Runahead Execution to Hide Memory Latency in High Level Synthesis","authors":"Shane T. Fleming, David B. Thomas","doi":"10.1109/FCCM.2017.33","DOIUrl":"https://doi.org/10.1109/FCCM.2017.33","url":null,"abstract":"Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"103 32","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113945472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

ParaDiMe: A Distributed Memory FPGA Router Based on Speculative Parallelism and Path Encoding ParaDiMe:一种基于推测并行和路径编码的分布式存储器FPGA路由器

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.34

Chin Hau Hoo, Akash Kumar

{"title":"ParaDiMe: A Distributed Memory FPGA Router Based on Speculative Parallelism and Path Encoding","authors":"Chin Hau Hoo, Akash Kumar","doi":"10.1109/FCCM.2017.34","DOIUrl":"https://doi.org/10.1109/FCCM.2017.34","url":null,"abstract":"The increase in speed and capacity of FPGAs is faster than the development of effective design tools to fully utilize it, and routing of nets remains as one of the most time-consuming stages of the FPGA design flow. While existing works have proposed methods of accelerating routing through parallelization, they are limited by the memory architecture of the system that they target. In this paper, we propose a distributed memory parallel FPGA router called ParaDiMe to address the limitations of existing works. ParaDiMe speculatively routes net in parallel and dynamically detects the need to reduce the number of active processes in order to achieve convergence. In addition, the synchronization overhead in ParaDiMe is significantly reduced through a careful design of the messaging protocol where paths to sinks are encoded in a space-efficient manner. Moreover, the frequency of synchronization is tuned to ensure convergence while minimizing the communication overhead. Compared to VTR, ParaDiMe achieves an average speedup of 19.8X with 32 processes while producing similar quality of results.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116319895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Bit-Width Based Resource Partitioning for CNN Acceleration on FPGA 基于位宽的FPGA CNN加速资源划分

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.13

Jianxin Guo, S. Yin, P. Ouyang, Leibo Liu, Shaojun Wei

引用次数: 20

Efficient Particle-Grid Space Interpolation of an FPGA-Accelerated Particle-in-Cell Plasma Simulation fpga加速粒子胞内等离子体模拟的高效粒子网格空间插值

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.63

Almomany Abedalmuhdi, B. Wells, K. Nishikawa

{"title":"Efficient Particle-Grid Space Interpolation of an FPGA-Accelerated Particle-in-Cell Plasma Simulation","authors":"Almomany Abedalmuhdi, B. Wells, K. Nishikawa","doi":"10.1109/FCCM.2017.63","DOIUrl":"https://doi.org/10.1109/FCCM.2017.63","url":null,"abstract":"This paper highlights on-going research to effectively utilize a commercially available spatially reconfigurable platform and the OpenCL framework to improve the run-time performance and reduce the overall energy consumption of an existing 2.5D Electrostatic Particle-in-Cell type plasma simulation. This problem is constrained by the finite internal FPGA resources and the performance mandate that all main OpenCL kernels for this application reside in a single FPGA image. The paper focuses on solving the particle-to-grid space interpolation phase of the simulation because of its inherent nondeterministic global memory access pattern. The implementation that is presented adheres closely to the original CPU-based model while employing local memory, task level pipelining, and replication of kernel resources to provide a much more deterministic and coalesced access pattern. The overall simulation has been shown to have an approximately 2.5-fold improvement in performance and a eight-fold improvement in energy consumption over the life of the simulation when compared to the reference single core CPU implementation.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123644371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

A Parameterizable Activation Function Generator for FPGA-Based Neural Network Applications 基于fpga的神经网络应用的可参数化激活函数生成器

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.40

S. M. H. Ho, C.-H. Dominic Hung, Ho-Cheung Ng, Maolin Wang, Hayden Kwok-Hay So

引用次数: 2

FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off fpga加速密集线性机器学习:精度收敛的权衡

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.39

Kaan Kara, Dan Alistarh, G. Alonso, O. Mutlu, Ce Zhang

引用次数: 66

Fast and Energy-Driven Design Space Exploration for Heterogeneous Architectures 异构架构的快速和能量驱动设计空间探索

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.31

Baptiste Roux, M. Gautier, O. Sentieys, J. Delahaye

{"title":"Fast and Energy-Driven Design Space Exploration for Heterogeneous Architectures","authors":"Baptiste Roux, M. Gautier, O. Sentieys, J. Delahaye","doi":"10.1109/FCCM.2017.31","DOIUrl":"https://doi.org/10.1109/FCCM.2017.31","url":null,"abstract":"In the last years, the integration of specialized hardware accelerators in Multiprocessor System-on-Chip (MpSoC) led to a new kind of architectures combining both software (SW) and hardware (HW) computational resources. For these new Heterogeneous MpSoC (HMpSoC) architectures, performance and energy consumption depend on a large set of parameters such as the HW/SW partitioning, the type of HW implementation or the communication cost. Design Space Exploration (DSE) consists in adjusting these parameters while monitoring a set of metrics (execution time, power, energy efficiency) to find the best mapping of the application on the targeted architecture. With the shift from performance-aware to energy-aware designs, computer-aided design and development tools try to reduce the large design space by simplifying HW/SW mapping mechanisms. However, energy consumption is not well supported in most of DSE tools due to the difficulty to fast and accurately estimate the energy consumption. To this aim, this work introduces a DSE method based on an analytical power model to circumvent the computation time bottleneck of state-of-the-art DSE methods. This exploration method proposes to optimize the HW/SW partitioning and mapping under user-defined objectives, especially an energy constraint. It targets tiling-based parallel applications and relies on an analytical power model that provides the DSE framework with the execution time and energy of a HW/SW configuration. The power model parameters are obtained with the measurements of a tiny subset of the design space, which are then injected into two extraction functions to obtain analytical formulations of the execution time and the energy consumption of the computation kernel. The partitioning problem constraints are defined as a set of inequalities with Boolean, integer (discrete) and non-integer (continuous) variables within a Mixed Integer Linear Programming (MILP) framework. Then, the best configuration that minimizes the user objective (e.g. execution time or total energy consumption) can be efficiently determined using commercial or open source solvers within a second. This methodology was tested on a Zynq-based heterogeneous architecture with two application kernels: a matrix multiplication and a Stencil computation. The results show a minimum of 12% acceleration speed-up and energy saving compared to standard approaches. They also show that the most energy-efficient solution is application-and platform-dependent and moreover hardly predictable. Such method could be included in a complete framework with a multi-step exploration to obtain an energy-efficient mapping of a full application on HMpSoC and to open new opportunity for future computer-aided design tools.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122056726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0