R. Porter, J. Frigo, M. Gokhale, C. Wolinski, François Charot, Charles Wagner
{"title":"A Programmable, Maximal Throughput Architecture for Neighborhood Image Processing","authors":"R. Porter, J. Frigo, M. Gokhale, C. Wolinski, François Charot, Charles Wagner","doi":"10.1109/FCCM.2006.13","DOIUrl":"https://doi.org/10.1109/FCCM.2006.13","url":null,"abstract":"The authors propose a run-time re-configurable architecture for local neighborhood image processing. Discussion of how the new architecture can offer improved flexibility to the developer. The authors show that for a satellite image feature extraction application, our architecture, implemented on Stratix II and Virtex 2 field programmable gate arrays, achieves similar performance, hardware resource utilization, and throughput as fully pipelined systolic array architecture","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125214075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rapid Design of Special-Purpose Pipeline Processors with FPGAs and its Application to Computational Fluid Dynamics","authors":"G. Lienhart, G. M. Martinez, A. Kugel, R. Männer","doi":"10.1109/FCCM.2006.60","DOIUrl":"https://doi.org/10.1109/FCCM.2006.60","url":null,"abstract":"This paper presents a framework for rapid development of FPGA based custom processors based on floating-point calculation units. The framework consists of a fully parameterized floating-point library, an easy-to-use pipeline generator and an interface generator for memory and I/O-modules. The performance of this approach is shown for the implementation of an SPH-algorithm.","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123976389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nachiket Kapre, Nikil Mehta, Michael DeLorimier, Raphael Rubin, Henry Barnor, M. J. Wilson, M. Wrighton, A. DeHon
{"title":"Packet Switched vs. Time Multiplexed FPGA Overlay Networks","authors":"Nachiket Kapre, Nikil Mehta, Michael DeLorimier, Raphael Rubin, Henry Barnor, M. J. Wilson, M. Wrighton, A. DeHon","doi":"10.1109/FCCM.2006.55","DOIUrl":"https://doi.org/10.1109/FCCM.2006.55","url":null,"abstract":"Dedicated, spatially configured FPGA interconnect is efficient for applications that require high throughput connections between processing elements (PEs) but with a limited degree of PE interconnectivity (e.g. wiring up gates and datapaths). Applications which virtualize PEs may require a large number of distinct PE-to-PE connections (e.g. using one PE to simulate 100s of operators, each requiring input data from thousands of other operators), but with each connection having low throughput compared with the PE's operating cycle time. In these highly interconnected conditions, dedicating spatial interconnect resources for all possible connections is costly and inefficient. Alternatively, we can time share physical network resources by virtualizing interconnect links, either by statically scheduling the sharing of resources prior to runtime or by dynamically negotiating resources at runtime. We explore the tradeoffs (e.g. area, route latency, route quality) between time-multiplexed and packet-switched networks overlayed on top of commodity FPGAs. We demonstrate modular and scalable networks which operate on a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed, offline scheduling offers up to a 63% performance increase over online, packet-switched scheduling for equivalent topologies. When applying designs to equivalent area, packet-switching is up to 2times faster for small area designs while time-multiplexing is up to 5times faster for larger area designs. When limited to the capacity of a XC2V6000, if all communication is known, time-multiplexed routing outperforms packet-switching; however when the active set of links drops below 40% of the potential links, packet-switched routing can outperform time-multiplexing","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127729400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tatsuhiro Tachibana, Y. Murata, N. Shibata, K. Yasumoto, Minoru Ito
{"title":"General Architecture for Hardware Implementation of Genetic Algorithm","authors":"Tatsuhiro Tachibana, Y. Murata, N. Shibata, K. Yasumoto, Minoru Ito","doi":"10.1109/FCCM.2006.43","DOIUrl":"https://doi.org/10.1109/FCCM.2006.43","url":null,"abstract":"In this paper, the authors propose a technique to flexibly implement genetic algorithms (GAs) for various problems on FPGAs. For the purpose, the authors propose a common architecture for GA. The proposed architecture allows designers to easily implement a GA as a hardware circuit consisting of parallel pipelines which execute GA operations. The proposed architecture is scalable to increase the number of parallel pipelines. The architecture is applicable to various problems and allows designers to estimate the size of resulting circuits. The authors give a model for predicting the size of resulting circuits from given parameters. Based on the proposed method, the authors have implemented a tool to facilitate GA circuit design and development. Through experiments using knapsack problem and traveling salesman problem (TSP), the authors show that the FPGA circuits synthesized based on the proposed method run much faster and consume much lower power than software implementation on a PC and the model can predict the size of the resulting circuit accurately enough","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"951 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129663204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Hardware Generation of Random Variates with Arbitrary Distributions","authors":"David B. Thomas, W. Luk","doi":"10.1109/FCCM.2006.39","DOIUrl":"https://doi.org/10.1109/FCCM.2006.39","url":null,"abstract":"This paper presents a technique for efficiently generating random numbers from a given probability distribution. This is achieved by using a generic hardware architecture, which transforms uniform random numbers according to a distribution mapping stored in RAM, and a software approximation generator that creates distribution mappings for any given target distribution. This technique has many features not found in current non-uniform random number generators, such as the ability to adjust the target distribution while the generator is running, per-cycle switching between distributions, and the ability to generate distributions with discontinuities in the probability density function","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124311282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Quinn, D. Bhaduri, C. Teuscher, P. Graham, M. Gokhale
{"title":"The STAR-C Truth: Analyzing Reconfigurable Supercomputing Reliability","authors":"H. Quinn, D. Bhaduri, C. Teuscher, P. Graham, M. Gokhale","doi":"10.1109/FCCM.2006.70","DOIUrl":"https://doi.org/10.1109/FCCM.2006.70","url":null,"abstract":"In this abstract, the authors present an overview of a reliability analysis toolset, called the scalable tool for the analysis of reliable systems (STAR systems), with modules for determining the reliability of FPGA designs (STAR-circuits) and reconfigurable supercomputers (STAR-reconfigurable supercomputers","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115404960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware/Software Approach to Molecular Dynamics on Reconfigurable Computers","authors":"R. Scrofano, M. Gokhale, F. Trouw, V. Prasanna","doi":"10.1109/FCCM.2006.46","DOIUrl":"https://doi.org/10.1109/FCCM.2006.46","url":null,"abstract":"With advances in re configurable hardware, especially field-programmable gate arrays (FPGAs), it has become possible to use reconfigurable hardware to accelerate complex applications, such as those in scientific computing. There has been a resulting development of reconfigurable computers - computers which have both general purpose processors and reconfigurable hardware, as well as memory and high-performance interconnection networks. In this paper, we study the acceleration of molecular dynamics simulations using reconfigurable computers. We describe how we partition the application between software and hardware and then model the performance of several alternatives for the task mapped to hardware. We describe an implementation of one of these alternatives on a reconfigurable computer and demonstrate that for two real-world simulations, it achieves a 2 times speed-up over the software baseline. We then compare our design and results to those of prior efforts and explain the advantages of the hardware/software approach, including flexibility","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125397939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components","authors":"R. Strzodka, Dominik Göddeke","doi":"10.1109/FCCM.2006.57","DOIUrl":"https://doi.org/10.1109/FCCM.2006.57","url":null,"abstract":"FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper the authors take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson problem as a typical PDE example the authors show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, the authors adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the conjugate gradient solver. The authors combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123480626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multithreaded Soft Processor for SoPC Area Reduction","authors":"B. Fort, D. Capalija, Z. Vranesic, S. Brown","doi":"10.1109/FCCM.2006.10","DOIUrl":"https://doi.org/10.1109/FCCM.2006.10","url":null,"abstract":"The growth in size and performance of field programmable gate arrays (FPGAs) has compelled system-on-a-programmable-chip (SoPC) designers to use soft processors for controlling systems with large numbers of intellectual property (IP) blocks. Soft processors control IP blocks, which are accessed by the processor either as peripheral devices or/and by using custom instructions (CIs). In large systems, chip multiprocessors (CMPs) are used to execute many programs concurrently. When these programs require the use of the same IP blocks which are accessed as peripheral devices, they may have to stall waiting for their turn. In the case of CIs, the FPGA logic blocks that implement the CIs may have to be replicated for each processor. In both of these cases FPGA area is wasted, either by idle soft processors or the replication of CI logic blocks. This paper presents a multithreaded (MT) soft processor for area reduction in SoPC implementations. An MT processor allows multiple programs to access the same IP without the need for the logic replication or the replication of whole processors. We first designed a single-threaded processor that is instruction-set compatible to Altera's Nios II soft processor. Our processor is approximately the same size as the Nios II economy version, with equivalent performance. We augmented our processor to have 4-way interleaved multithreading capabilities. This paper compares the area usage and performance of the MT processor versus two CMP systems, using Altera's and our single-threaded processors, separately. Our results show that we can achieve an area savings of about 45% for the processor itself, in addition to the area savings due to not replicating CI logic blocks","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124426465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining Instruction Coding and Scheduling to Optimize Energy in System-on-FPGA","authors":"R. Dimond, O. Mencer, W. Luk","doi":"10.1109/FCCM.2006.31","DOIUrl":"https://doi.org/10.1109/FCCM.2006.31","url":null,"abstract":"In this paper, we investigate a combination of two techniques n struction coding and instruction re-ordering - for optimizing energy in embedded processor control. We present the first practical, hardware implementation incorporating both approaches as part of a novel flow for automatic power-optimization of an FPGA soft processor. Our infrastructure generates customized processors and associated software, to enable power optimizations to be evaluated on multiple architectures and FPGA platforms. We evaluate using both software estimates of power and actual measurements from both low-cost and high-performance FPGAs. We generate over 150 optimized processor designs for two FPGA platforms, two processor architectures and six different benchmarks at four different clock rates and achieve consistent measured dynamic power reduction of up to 74%, without performance cost. Our results are applicable beyond processor optimization, quantifying the benefits of practical switching reduction and highlighting non-obvious pitfalls and complexities in dynamic power optimization","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124753448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}