{"title":"Dynamically programmable Reed Solomon processor with embedded Galois Field multiplier","authors":"A. El-Rayis, Xin Zhao, T. Arslan, A. Erdogan","doi":"10.1109/FPT.2008.4762395","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762395","url":null,"abstract":"This work presents a novel reconfigurable Galois field multiplier embedded in a dynamically reconfigurable processor for real time programmable Reed Solomon (RS) encoder and decoder targeting various communication standards. The fundamental operation in Reed-Solomon encoding and decoding is the multiplication over Galois field (GF). The reconfigurable GF multiplier with single instruction multiple data (SIMD) support is presented here, as an instruction set extension to the processor. The processor supports the RS coding to be programmable for Galois Field (28) with its sixteen primitive polynomials and for all supported data block sizes. Various optimization techniques have been applied in order to enhance the processor throughput. The throughput achieved for RS (204,188) is up to 202 Mbps for the encoder demonstrating a future proof flexible design.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"1117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122933352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and implementation of a high performance financial Monte-Carlo simulation engine on an FPGA supercomputer","authors":"Xiang Tian, K. Benkrid","doi":"10.1109/FPT.2008.4762369","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762369","url":null,"abstract":"Monte-Carlo simulation is a very widely used technique in scientific computations in general with huge computation benefits in solving problems where closed form solutions are impossible to derive. This technique is also characterized by a high degree of parallelism as a large number of different simulation paths need to be calculated, which makes it ideal for a parallel hardware implementation. This paper illustrates the benefits of such implementation in the context of financial computing as it implements a financial Monte-Carlo simulation engine on an FPGA-based supercomputer, called Maxwell, developed at the University of Edinburgh. The latter consists of a 32 CPU cluster augmented with 64 Virtex-4 Xilinx FPGAs connected in a 2D torus. Our engine can implement various Monte-Carlo simulations on the Maxwell machine with speed-ups in the 3-order magnitude compared to equivalent software implementations. This is illustrated in this paper in the context of an implementation of the Black-Scholes option pricing model. Real hardware implementation shows that our FPGA-based implementation of the Black-Scholes model outperforms an equivalent software implementation running on a workstation cluster with the same number of computing nodes (CPU/FPGA) by a factor of 750, which is the fastest ever reported FPGA implementation of this model.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130843397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA-specific approach to floating-point accumulation and sum-of-products","authors":"F. D. Dinechin, B. Pasca, O. Creţ, R. Tudoran","doi":"10.1109/FPT.2008.4762363","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762363","url":null,"abstract":"This article studies two common situations where the flexibility of FPGAs allows one to design application-specific floating-point operators which are more efficient and more accurate than those offered by processors and GPUs. First, for applications involving the addition of a large number of floating-point values, an ad-hoc accumulator is proposed. By tailoring its parameters to the numerical requirements of the application, it can be made arbitrarily accurate, at an area cost comparable to that of a standard floating-point adder, and at a higher frequency. The second example is the sum-of-product operation, which is the building block of matrix computations. A novel architecture is proposed that feeds the previous accumulator out of a floating-point multiplier whose rounding logic has been removed, again improving the area/accuracy tradeoff. These architectures are implemented within the FloPoCo generator, freely available under the LGPL.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127789925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kofi Appiah, A. Hunter, P. Dickinson, Jonathan Owens
{"title":"A run-length based connected component algorithm for FPGA implementation","authors":"Kofi Appiah, A. Hunter, P. Dickinson, Jonathan Owens","doi":"10.1109/FPT.2008.4762381","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762381","url":null,"abstract":"This paper introduces a real-time connected component labelling algorithm designed for field programmable gate array (FPGA) implementation. The algorithm run-length encodes the image, and performs connected component analysis on this representation. The run-length encoding, together with other parts of the algorithm, is performed in parallel; sequential operations are minimized as the number of runs are typically less than the number of pixels. The architecture is designed mainly on Block RAM (i.e. internal RAM) of the FPGA. A comparison with the multi-pass algorithm in hardware and software is presented to show the advantages of the algorithm. The algorithm runs comfortably in real-time with reasonably low resource utilization, making integration with other real-time algorithms feasible.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131527571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Zipf, H. Hinkelmann, Hui Shao, R. Dogaru, M. Glesner
{"title":"An area-efficient FPGA realisation of a codebook-based image compression method","authors":"P. Zipf, H. Hinkelmann, Hui Shao, R. Dogaru, M. Glesner","doi":"10.1109/FPT.2008.4762415","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762415","url":null,"abstract":"We present a hardware implementation of an efficient image compression method optimised for small FPGAs. The compression method is based on a codebook of reference patterns to support multiplication-free quantisation of the image data. Based on specific features of a low-cost FPGA architecture, a pipelined implementation is developed and evaluated. The implemented hardware benefits from the simple structure of the compression method and is optimised for area and performance. The realised hardware as well as the underlying compression mechanism are described and the synthesis results for different model variants are compared. The results show that a high compression rate is possible at extremely low hardware costs. Also, a high frame rate can be obtained even on a low-cost FPGA.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131973056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A scalable reconfiguration mechanism for fast dynamic reconfiguration","authors":"H. Hinkelmann, P. Zipf, M. Glesner","doi":"10.1109/FPT.2008.4762377","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762377","url":null,"abstract":"Hardware reconfiguration during run-time provides attractive features like fast adaptivity, high hardware utilisation, and low area consumption due to efficient reuse of hardware components. In this paper, a novel multi-layered reconfiguration mechanism is proposed that allows frequent dynamic reconfiguration at very low latencies. It combines successful existing techniques such as multi-context and partial reconfiguration with new ideas like tag-matching and reconfiguration profiles to one uniform approach. As an important feature, the proposed reconfiguration mechanism is well scalable and can be adapted to given hardware structures easily, thus being applicable to virtually any reconfigurable fabric. In contrast to many existing techniques, it also supports even very heterogeneous architectures found for instance in custom reconfigurable systems. By experimental results, we show that our reconfiguration mechanism provides significantly lower reconfiguration latencies compared to some common existing techniques.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132218452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A profiler for a heterogeneous multi-core multi-FPGA system","authors":"Daniel Nunes, Manuel Saldaña, P. Chow","doi":"10.1109/FPT.2008.4762373","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762373","url":null,"abstract":"Understanding the behavior of an application is rarely a trivial task, due to the complexity of the system in which the application is executed, and the complexity of the application itself. The task becomes even more troublesome, if the application is being run in a parallel environment where relationships between each application execution are needed to grasp the necessary understanding of the application behavior. FPGA flexibility increases the complexity of such tasks by allowing not only changes to the application, to adapt to the hardware, but also to tailor the hardware for a specific application. To take full advantage of these systems, a tool that will help the user to understand an application is paramount. In this paper, we present a profiler for the TMD, a heterogeneous multicore multiFPGA system designed at the University of Toronto. The profiler can be configured for a specific application running on a specific hardware configuration. It allows retrieval of all communication calls and any user state defined by instrumentation of the source code. We test the profiler with two simple case studies: MPI Barrier, where we compare a sequential with a binary tree algorithm, and a heat equation solver that uses the Jacobi iterations method, where we compare blocking with non-blocking MPI calls.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"29 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131470521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yoshiki Saito, T. Shirai, Takuro Nakamura, T. Nishimura, Y. Hasegawa, S. Tsutsumi, Toshihiro Kashima, M. Nakata, S. Takeda, K. Usami, H. Amano
{"title":"Leakage power reduction for coarse grained dynamically reconfigurable processor arrays with fine grained Power Gating technique","authors":"Yoshiki Saito, T. Shirai, Takuro Nakamura, T. Nishimura, Y. Hasegawa, S. Tsutsumi, Toshihiro Kashima, M. Nakata, S. Takeda, K. Usami, H. Amano","doi":"10.1109/FPT.2008.4762410","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762410","url":null,"abstract":"One of the benefits of coarse grained dynamically reconfigurable processor array(DRPA) is its low dynamic power consumption by operating a number of processing elements(PE) in parallel with low clock frequency. However, in the future advanced processes, leakage power will occupy a considerable part of the total power consumption, and it may degrade the advantage of DRPAs. In order to reduce the leakage power, a fine grained Power Gating(PG) is applied to a DRPA, MuCCRA-2.32b, and leakage power and area overhead are measured. We evaluated the effect of two control modes; Pair and Unit Individual based on layout design and real applications. It appears that by applying PG for ALUs and SMUs in PEs individually, 48% of leakage power can be reduced with 9.0% of area overhead.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117310158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimised single pass connected components analysis","authors":"Ni Ma, D. Bailey, C. T. Johnston","doi":"10.1109/FPT.2008.4762382","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762382","url":null,"abstract":"Classical connected components labelling algorithms are unsuitable for real-time processing of streamed images on an FPGA because they require two passes through the image. Recently, a single-pass algorithm was proposed that avoided the need to buffer an intermediate image. In this paper, a new single pass algorithm is described that is a considerable improvement over the existing algorithms. The new algorithm reassigns and reuses labels each row to minimise the size of both the equivalence and region data tables. The optimised single-pass algorithm reduces the worst case memory requirement by over 100 times that of the original algorithm (for measuring region area), and reduces the latency to only 1 row.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132753293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic generation of decomposition based matrix inversion architectures","authors":"A. Irturk, Bridget Benson, A. Arfaee, R. Kastner","doi":"10.1109/FPT.2008.4762421","DOIUrl":"https://doi.org/10.1109/FPT.2008.4762421","url":null,"abstract":"Matrix inversion is an essential computation for various algorithms which are employed in multi-antenna wireless communication systems. FPGAs are ideal platforms for wireless communication; however, the need for vast amounts of customization throughout the design process of a matrix inversion core can overwhelm the designer. Decomposition methods provide the analytic simplicity and computational convenience necessary for computationally intensive matrix inversion. This paper presents automatic generation of different decomposition based matrix inversion architectures using a matrix inversion core generator tool, GUSTO with different parameterization options. We present automatic generation of a variety of general purpose matrix inversion architectures which have comparable results to published matrix inversion architecture implementations, but offer the advantage of providing the designer the ability to study the tradeoffs between architectures with different design parameters.","PeriodicalId":320925,"journal":{"name":"2008 International Conference on Field-Programmable Technology","volume":"30 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115931954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}