{"title":"Advanced Components in the Variable Precision Floating-Point Library","authors":"Xiaojun Wang, S. Braganza, M. Leeser","doi":"10.1109/FCCM.2006.21","DOIUrl":"https://doi.org/10.1109/FCCM.2006.21","url":null,"abstract":"Optimal reconfigurable hardware implementations may require the use of arbitrary floating-point formats that do not necessarily conform to IEEE specified sizes. The authors have previously presented a variable precision floating-point library for use with reconfigurable hardware. The authors recently added three advanced components: floating-point division, floating-point square root and floating-point accumulation to our library. These advanced components use algorithms that are well suited to FPGA implementations and exhibit a good tradeoff between area, latency and throughput. The floating-point format of our library is both general and flexible. All IEEE formats, including 64-bit double-precision format, are a subset of our format. All previously published floating-point formats for reconfigurable hardware are a subset of our format as well. The generic floating-point format supported by all of our library components makes it easy and convenient to create a pipelined, custom data path with optimal bitwidth for each operation. Our library can be used to achieve more parallelism and less power dissipation than adhering to a standard format. To further increase parallelism and reduce power dissipation, our library also supports hybrid fixed and floating point operations in the same design. The division and square root designs are based on table lookup and Taylor series expansion, and make use of memories and multipliers embedded on the FPGA chip. The iterative accumulator utilizes the library addition module as well as buffering and control logic to achieve performance similar to that of the addition by itself. They are all fully pipelined designs with clock speed comparable to that of other library components to aid the designer in implementing fast, complex, pipelined designs","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127748443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Sliding Window Operation Optimization for FPGA-Based","authors":"Haiqian Yu, M. Leeser","doi":"10.1109/FCCM.2006.29","DOIUrl":"https://doi.org/10.1109/FCCM.2006.29","url":null,"abstract":"FPGA-based computing boards are frequently used as hardware accelerators for image processing algorithms based on sliding window operations (SWOs). SWOs are both computationally intensive and data intensive and benefit from hardware acceleration with FPGAs, especially for delay sensitive applications. The current design process requires that, for each specific application using SWOs with different size of window, image, etc.; a detail design must be completed before a realistic estimate of the achievable speedup can be obtained. We present an automated tool, sliding window operation optimization (SWOOP), that generates the estimate of speedup for a high performance design before detailed implementation is complete. The achievable speedup is determined by the area of the FPGA, or, more often, the memory bandwidth to the processing elements. The memory bandwidth to each processing element is a combination of bandwidth to the FPGA and the efficient use of on-chip RAM as a data cache. SWOOP uses analytic techniques to automatically determine the number of parallel processing elements to implement on the FPGA, the assignment of input and output data to on-board memory, and the organization of data in on-chip memory to most effectively keep the processing elements busy. The result is a block layout of the final design, its memory architecture, and a measure of the achievable speedup. The results, compared to manual designs, show that the estimates obtained usinq SWOOP are very accurate","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"11221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114132067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGAs, GPUs and the PS2 - A Single Programming Methodology","authors":"Lee W. Howes, P. Price, O. Mencer, Olav Beckmann","doi":"10.1109/FCCM.2006.42","DOIUrl":"https://doi.org/10.1109/FCCM.2006.42","url":null,"abstract":"Field programmable gate arrays (FPGAs), graphics processing units (GPUs) and Sony's Playstation 2 vector units offer scope for hardware acceleration of applications. Implementing algorithms on multiple architectures can be a long and complicated process. We demonstrate an approach to compiling for FPGAs, GPUs and PS2 vector units using a unified description based on A Stream Compiler (ASC) for FPGAs. As an example of its use we implement a Monte Carlo simulation using ASC. The unified description allows us to evaluate optimisations for specific architectures on top of a single base description, saving time and effort","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121975595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of a Reconfigurable Processor for NIST Prime Field ECC","authors":"Kendall Ananyi, Daler N. Rakhmatov","doi":"10.1109/FCCM.2006.36","DOIUrl":"https://doi.org/10.1109/FCCM.2006.36","url":null,"abstract":"This paper describes a reconfigurable processor that provides support for basic elliptic curve cryptographic (ECC) operations over GF(p), such as modular addition, subtraction, multiplication, and inversion. The proposed processor can be configured for any of the five NIST primes with sizes ranging from 192 to 521 bits","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116471479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Open Source High Performance Floating-Point Modules","authors":"K. Hemmert, K. Underwood","doi":"10.1109/FCCM.2006.54","DOIUrl":"https://doi.org/10.1109/FCCM.2006.54","url":null,"abstract":"Given the logic density of modern FPGAs, it is feasible to use FPGAs for floating-point applications. However, it is important that any floating-point units that are used be highly optimized. This paper introduces an open source library of highly optimized floating-point units for Xilinx FPGAs. The units are fully IEEE compliant and acheive approximately 230 MHz operation frequency for double-precision add and multiply in a Xilinx Virtex-2-Pro FPGA (-7 speed grade). This speed is acheived with a 10 stage adder pipeline and a 12 stage multiplier pipeline. The area requirement is 571 slices for the adder and 905 slices for the multiplier","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114490199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Scalable Hybrid Regular Expression Pattern Matcher","authors":"J. Moscola, Young-Hee Cho, J. Lockwood","doi":"10.1109/FCCM.2006.18","DOIUrl":"https://doi.org/10.1109/FCCM.2006.18","url":null,"abstract":"In this paper, the authors present a reconfigurable hardware architecture for searching for regular expression patterns in streaming data. This new architecture is created by combining two popular pattern matching techniques: a pipelined character grid architecture (Baker, 2004), and a regular expression NFA architecture (Cho, 2006). The resulting hybrid architecture can scale the number of input characters while still maintaining the ability to scan for regular expression patterns","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Task Graph Approach for Efficient Exploitation of Reconfiguration in Dynamically Reconfigurable Systems","authors":"Kyprianos Papademetriou, A. Dollas","doi":"10.1109/FCCM.2006.19","DOIUrl":"https://doi.org/10.1109/FCCM.2006.19","url":null,"abstract":"Partial reconfiguration suffers from the inherent high latency and low throughput which is more considerable when reconfiguration is performed on-demand. This work deals with this overhead in processors combining a fixed processing unit (FPU), and a reconfigurable processing unit (RPU). Static and dynamic prefetching (Li, 2002), and instruction forecasting (Iliopoulos and Antonakopoulos, 2001) are targeting at reduction of the overhead through preloading of configurations. Banerjee et al. (2005) transform the task graph of an application and a heuristic algorithm evaluates the reduction in schedule length and selects the most promising configuration. Tasks are scheduled according to the physical resource constraints. In this work the prefetching model of Li (2002) was augmented by taking into account the hardware area constraints of a partially reconfigurable system. Given the task graph of an application, tasks with low probability to be executed are split and preloaded according to the hardware in order to be fully utilized. Thus, the time during which reconfiguration is overlapped with processor execution is increased","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132638599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shobana Padmanabhan, Moshe Looks, Dan Legorreta, Young-Hee Cho, J. Lockwood
{"title":"Hierarchical Clustering using Reconfigurable Devices","authors":"Shobana Padmanabhan, Moshe Looks, Dan Legorreta, Young-Hee Cho, J. Lockwood","doi":"10.1109/FCCM.2006.49","DOIUrl":"https://doi.org/10.1109/FCCM.2006.49","url":null,"abstract":"Non-hierarchical k-means algorithms have been implemented in hardware, most frequently for image clustering. Here, we focus on hierarchical clustering of text documents based on document similarity. To our knowledge, this is the first work to present a hierarchical clustering algorithm designed for hardware implementation and ours is the first hardware-accelerated implementation","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133253826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparse Matrix-Vector Multiplication for Finite Element Method Matrices on FPGAs","authors":"Y. El-Kurdi, W. Gross, D. Giannacopoulos","doi":"10.1109/FCCM.2006.65","DOIUrl":"https://doi.org/10.1109/FCCM.2006.65","url":null,"abstract":"The paper presents an architecture and an implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from finite element method (FEM) applications. The architecture is based on a pipelined linear array of processing elements (PEs). A hardware-oriented matrix \"striping\" scheme is developed which reduces the number of required processing elements. The current 8 PE prototype achieves a peak performance of 1.76 GFLOPS and a sustained performance of 1.5 GFLOPS with 8 GB/s of memory bandwidth. The SMVM-pipeline uses 30% of the logic resources and 40% of the memory resources of a Stratix S80 FPGA. By virtue of the local interconnect between the PEs, the SMVM-pipeline obtain scalability features that is only limited by FPGA resources instead of the communication overhead","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121538335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Feature Detection on a Reconfigurable Co-Processor","authors":"J. Mar, A. Bissacco, Stefano Soatto, S. Ghiasi","doi":"10.1109/FCCM.2006.50","DOIUrl":"https://doi.org/10.1109/FCCM.2006.50","url":null,"abstract":"In this paper, the authors propose a new design for feature detection used for tracking, which eliminates the need of a central computer to complete computations for the feature selection algorithm. Such a system constrains performance due to the delay in which data is transferred from camera to computer for processing. Our design suggests that feature detection computation can be done on a processor within the camera helping to reduce overall computation time for detection and increase performance for overall tracking system. However, these systems are often constrained by the processing power available to the camera. But with Benedetti and Perona's approach to Tomasi and Kanade's detection algorithm, such a design is possible to implement onto a camera system which would eliminate the delay and also improve performance over a tracking system designed on software","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121792746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}