Akihito Tsusaka, Mai Izawa, Rie Uno, Nobuyuki Ozaki, H. Amano
{"title":"A hardware complete detection mechanism for an energy efficient reconfigurable accelerator CMA","authors":"Akihito Tsusaka, Mai Izawa, Rie Uno, Nobuyuki Ozaki, H. Amano","doi":"10.1109/FPL.2013.6645594","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645594","url":null,"abstract":"Cool Mega Array (CMA) is an energy efficient Coarse Grained Reconfigurable processor Array (CGRA) consisting of a large PE (Processing Element) array. In order to reduce the power for storing intermediate results and clock tree, the PE array is consisting of combinatorial circuits. A hardware completion detection mechanism for CMA is proposed, implemented and evaluated. Each PE uses serially connected buffers with selectable taps, and the delay is decided according to the operation executed in the PE. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and wires are accounted. The post layout simulation revealed that the same performance without the mechanism can be obtained only with 5.1% area overhead and less than 6% extra power consumption. With the mechanism, a single micro-code can be used for various supply voltages to PE array.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133639897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Degradation in FPGAs: Monitoring, modeling and mitigation (PHD forum paper: Thesis broad overview)","authors":"A. Amouri, M. Tahoori","doi":"10.1109/FPL.2013.6645614","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645614","url":null,"abstract":"The continuous shrinking of CMOS transistors in the nano-scale era poses many manufacturing and reliability challenges such as process variation, sub-threshold leakage, power dissipation, increased circuit noise sensitivity, and reliability concerns due to transient (e.g. radiation-induced soft errors) and permanent (e.g. transistor aging) failures [1, 2]. State-of-the-art FPGAs, pushed by the ever-increasing demands on higher performance and lower power, use the latest advancements in CMOS technology [3, 4], and thus they share most of these challenges. Therefore, to guarantee the required lifetime of FPGA-mapped systems in the field, proper techniques at various levels should be devised. Transistor aging, as an important factor, causes an increase in the magnitude of threshold voltage, which in turn slows down the switching speed of the transistor and leads to timing failures and faster wear-out rates [5]. To properly deal with this issue in FPGAs, it requires modeling, monitoring and mitigation at device and architecture levels as well as the tool-chain at user level.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130268747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image recognition operation on a dynamically reconfigurable vison architecture","authors":"Yuki Kamikubo, Minoru Watanabe, S. Kawahito","doi":"10.1109/FPL.2013.6645603","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645603","url":null,"abstract":"Recently, for use in autonomous vehicles and robots, demand has been increasing for high-speed image recognition that is superior to that of the human eye. However, to recognize numerous images quickly, such systems require many template images to be read out dynamically from memory. They must then be sent to a processor quickly. Achieving such high-speed real-time image recognition operation is difficult because of the bottleneck of the transfer between the memory and the processor. To alleviate that bottleneck, a dynamically reconfigurable vision architecture was proposed. This paper presents 16-gray scale image recognition operation of the proposed architecture.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130416220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient FPGA overlay for portable custom instruction set extensions","authors":"Dirk Koch, Christian Beckhoff, G. Lemieux","doi":"10.1109/FPL.2013.6645517","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645517","url":null,"abstract":"Custom instruction set extensions can substantially boost performance of reconfigurable softcore CPUs. While this approach is commonly tailored to one specific FPGA system, we are presenting a fine-grained FPGA-like overlay architecture which can be implemented in the user logic of various FPGA families from different vendors. This allows the execution of a portable application consisting of a program binary and an overlay configuration in a completely heterogeneous environment. Furthermore, we are presenting different optimizations for dramatically reducing the implementation cost of the proposed overlay architecture. In particular, this includes the mapping of the overlay interconnection network directly into the switch fabric of the hosting FPGA. Our case study demonstrates an overhead reduction of an order of magnitude as compared to related approaches.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130450544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-accelerated sliding window classifier with structured features","authors":"Ondrej Sychrovsky, Martin Matousek, R. Sára","doi":"10.1109/FPL.2013.6645560","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645560","url":null,"abstract":"Certain classification tasks in computer vision require the classifier response to be computed in every pixel of an image. When combined with large, complex features, it becomes challenging to build such a classifier on a standard PC architecture and achieve real-time performance. We present an FPGA implementation of a car wheel classifier response computation, built as an instantiation of a generic classification system. An interesting optimization problem concerning performance and speed is addressed. Our implementation is running in real-time as a part of a more complex collision mitigation system based on car detection in video data.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"266 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134255356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In pursuit of instant gratification for FPGA design","authors":"A. Love, Wenwei Zha, P. Athanas","doi":"10.1109/FPL.2013.6645505","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645505","url":null,"abstract":"This paper describes an alternative FPGA design compilation flow to reduce the back-end time required to implement a Xilinx FPGA design. Using a library of precompiled modules and associated meta-data, bitstream-level assembly of desired designs can occur in a fraction of the time of traditional back-end tools. Modules are bound, placed, and routed using custom bitstream assembly with the primary objective of rapid compilation while preserving performance. Since vendor tools are not needed for assembly, compilation can be performed in embedded and/or untethered environments. As a result, large device compilations can be assembled in seconds. This turbo flow (TFlow) enables software-like turn-around time for faster prototyping and increased productivity.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130065769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Runtime assertions and exceptions for streaming systems","authors":"T. Todman, W. Luk","doi":"10.1109/FPL.2013.6645597","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645597","url":null,"abstract":"We present an approach to enable run-time, in-circuit assertions and exceptions in reconfigurable hardware designs. Static, compile-time checking, including formal verification, can catch many errors before a reconfigurable design is implemented. However, many other errors cannot be caught by static approaches, including those due to run-time data. Our approach allows users to add run-time assertions and exceptions to a design, giving multiple ways to handle run-time errors. Our work includes an abstract approach to adding assertions and exceptions to a design, a concrete implementation for Maxeler streaming designs, and an evaluation. Results show low overhead for adding exceptions to a design.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132239411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TputCache: High-frequency, multi-way cache for high-throughput FPGA applications","authors":"Aaron Severance, G. Lemieux","doi":"10.1109/FPL.2013.6645537","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645537","url":null,"abstract":"Throughput processing involves using many different contexts or threads to solve multiple problems or subproblems in parallel, where the size of the problem is large enough that latency can be tolerated. Bandwidth is required to support multiple concurrent executions, however, and utilizing multiple external memory channels is costly. For small working sets, FPGA designers can use on-chip BRAMs achieve the necessary bandwidth without increasing the system cost. Designing algorithms around fixed-size local memories is difficult, however, as there is no graceful fallback if the problem size exceeds the amount of local memory. This paper introduces TputCache, a cache designed to meet the needs of throughput processing on FPGAs, giving the throughput performance of on-chip BRAMs when the problem size fits in local memory. The design utilizes a replay based architecture to achieve high frequency with very low resource overheads.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"7 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122193228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire
{"title":"Comparing and combining GPU and FPGA accelerators in an image processing context","authors":"B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire","doi":"10.1109/FPL.2013.6645552","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645552","url":null,"abstract":"Nowadays, processors alone cannot deliver what computation hungry image processing applications demand. An alternative is to use hardware accelerators such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs). Applications, however, exhibit different performance characteristics depending on the accelerator. This paper describes the hybrid platform and the programming environment that allows to efficiently create programs on a combined GPU/FPGA desktop. We use the roofline model to identify the most appropriate accelerator for each application and High-Level Synthesis (HLS) tools to reduce the FPGA development time. To introduce our platform and tool chain both accelerators are compared by implementing a basic image operation. Next, a promising algorithm is explored and implemented, splitting and distributing the work between GPU, FPGA and CPU in order to validate the hybrid concept. Our results show that their combination exhibits a higher performance for computational intensive image processing applications than a GPU only.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125243992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel and scalable custom computing for real-time fluid simulation on a cluster node with four tightly-coupled FPGAs","authors":"K. Sano, R. Ito, Hayato Suzuki, Yoshiaki Kono","doi":"10.1109/FPL.2013.6645625","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645625","url":null,"abstract":"Summary form only given. Numerical simulation based on computational fluid dynamics (CFD) is now an indispensable technique especially in industry due to its acquisition capability of various data at a lower cost than experiments using a wind tunnel. The lattice Boltzmann method (LBM) is one of the CFD schemes, which is used to compute various problems including multiphase flow. LBM has good parallelism, but simultaneously requires many data to compute each lattice point, resulting in a low operational intensity. Consequently, the sustained performance of LBM is limited by memory bandwidth rather than arithmetic performance when computed by using general-purpose processors and GPUs. To make matters worse, insufficient bandwidth and high-latency of an interconnection network cause a relatively big overhead in parallel computing, especially in the case of strong-scaling.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127386809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}