{"title":"ASC-Based Acceleration in an FPGA with a Processor Core Using Software-Only Skills","authors":"Evgeny Fiksman, Y. Birk, O. Mencer","doi":"10.1109/FCCM.2006.26","DOIUrl":"https://doi.org/10.1109/FCCM.2006.26","url":null,"abstract":"A stream compiler (ASC) generates net lists for hardware (FPGA) accelerators from C-like descriptions, obviating the need for hardware skills. The authors present a backend adapter that enables integration of such accelerators with a processor core in the same FPGA. Development of hybrid ASC-accelerated applications using software-only skills is thus made possible, as illustrated by a hybrid power-conscious iDCT implementation","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122632519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application-Specific Memory Interleaving Enables High Performance in FPGA-based Grid Computations","authors":"T. Court, M. Herbordt","doi":"10.1109/FCCM.2006.25","DOIUrl":"https://doi.org/10.1109/FCCM.2006.25","url":null,"abstract":"Current generations of FPGAs create possibilities for innovative, application-specific computation pipelines. In many cases, the pipeline can fully exploit the FPGA's parallelism only when multiple operands are available concurrently, requiring clusters of values to be fetched from memory. These clusters of values often have fixed organization, as in the eight grid points around an off-grid position that are needed for 3D interpolation of a value at that position. This paper presents a technique for creating custom interleaving of the FPGA's on-chip memories, giving access to the entire cluster of values in one memory cycle. This technique works on grids of 2, 3, or more dimensions, on many non-rectangular grids, and on cluster organization specific to each application. The authors report the initial version of a design tool that inputs the relative positions of grid points in the access cluster, and produces synthesizable HDL code for the custom-interleaved memory","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127483616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scheduling divisible loads on partially reconfigurable hardware","authors":"K. Vikram, V. Vasudevan","doi":"10.1109/FCCM.2006.63","DOIUrl":"https://doi.org/10.1109/FCCM.2006.63","url":null,"abstract":"For a task mapped to the reconfigurable fabric (RF) of partially reconfigurable hybrid processor architecture, significant speedup can be obtained if multiple processing units (PUs) are used to accelerate the task. In this paper, the authors present the results obtained from a quantitative analysis for a single data-parallel task mapped to the RF of bus-based hybrid processor architecture. The architectural constraints in this case include run-time reconfiguration delay and a shared data bus to main memory","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"386 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126245769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware/Software Co-Design Architecture for Lattice Decoding Algorithms","authors":"C. Liang, Jing Ma, Xinming Huang","doi":"10.1109/FCCM.2006.47","DOIUrl":"https://doi.org/10.1109/FCCM.2006.47","url":null,"abstract":"This paper presents hardware/software co-design architecture targeted on a single FPGA for two typical lattice decoding algorithms in MIMO system. Two levels of parallelisms are analyzed for an efficient implementation with the preprocessing part on embedded MicroBlaze soft processor and the decoder part on customized hardware. The system prototypes of the AV and VB decoders show that they support up to 34.2 Mbps and 3.15 Mbps data rate respectively on XUP Virtex-II pro developing board, which are 19 and 16 times faster than their respective implementations on a DSP","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122392421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Architecture for Efficient Hardware Data Mining using Reconfigurable Computing Systems","authors":"Z. Baker, V. Prasanna","doi":"10.1109/FCCM.2006.22","DOIUrl":"https://doi.org/10.1109/FCCM.2006.22","url":null,"abstract":"The Apriori algorithm is a fundamental correlation-based data mining kernel used in a variety of fields. The innovation in this paper is a highly parallel custom architecture implemented on a reconfigurable computing system. Using this \"bitmapped CAM,\" the time and area required for executing the subset operations fundamental to data mining can be significantly reduced. The bitmapped CAM architecture implementation on an FPGA-accelerated high performance workstation provides a performance acceleration of orders of magnitude over software-based systems. The bitmapped CAM utilizes redundancy within the candidate data to efficiently store and process many subset operations simultaneously. The efficiency of this operation allows 140 units to process about 2,240 subset operations simultaneously. Using industry-standard benchmarking databases, we have tested the bitmapped CAM architecture and shown the platform provides a minimum of 24times (and often much higher) time performance advantage over the fastest software Apriori implementations","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128269234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effects of High-Level Discrete Signal Transform Formulations on Partitioning for Multi-FPGA Architectures","authors":"R. Arce-Nazario, M. Jimenez, D. Rodríguez","doi":"10.1109/FCCM.2006.38","DOIUrl":"https://doi.org/10.1109/FCCM.2006.38","url":null,"abstract":"The achievement of effective implementations to multi-FPGA architectures is greatly dependent on the process of partitioning. Although several automated high-level partitioning (HLP) methods have been reported (Srinivasan, et al., 2001) most of them are designed to solve general partitioning problems, and tend to apply generic local optimization techniques that miss out on alternate formulations that become apparent only with knowledge of the algorithm's functionality. The algorithmic formulation of discrete signal transforms (DST) especially that of the DFT has been extensively studied. Automated computational algebra platforms for the algorithmic manipulation of fast transform algorithms have been proposed, as well as automated methods to optimize DST implementations to general purpose processor platforms (Puschel et al., 2001) However, these methods have yet to be successfully adapted to automated partitioning methodologies for dedicated distributed hardware platforms","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133752634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaos Bellas, S. Chai, Malcolm Dwyer, D. Linzmeier
{"title":"Template-Based Generation of Streaming Accelators from a High Level Presentation","authors":"Nikolaos Bellas, S. Chai, Malcolm Dwyer, D. Linzmeier","doi":"10.1109/FCCM.2006.69","DOIUrl":"https://doi.org/10.1109/FCCM.2006.69","url":null,"abstract":"A design methodology and prototype tool to automate the design and architectural exploration of hardware accelerators are described in this paper. In comparison to other approaches, we utilize a well-engineered template to enable fast convergence to an area and speed efficient design. We show how this methodology is used for an application set with various architectural configurations","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132296175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. H. Ho, P. Leong, W. Luk, S. Wilton, S. López-Buedo
{"title":"Virtual Embedded Blocks: A Methodology for Evaluating Embedded Elements in FPGAs","authors":"C. H. Ho, P. Leong, W. Luk, S. Wilton, S. López-Buedo","doi":"10.1109/FCCM.2006.71","DOIUrl":"https://doi.org/10.1109/FCCM.2006.71","url":null,"abstract":"Embedded elements, such as block multipliers, are increasingly used in advanced field programmable gate array (FPGA) devices to improve efficiency in speed, area and power consumption. A methodology is described for assessing the impact of such embedded elements on efficiency. The methodology involves creating dummy elements, called virtual embedded blocks (VEBs), in the FPGA to model the size, position and delay of the embedded elements. The standard design flow offered by FPGA and CAD vendors can be used for mapping, placement, routing and retiming of designs with VEBs. The speed and resource utilisation of the resulting designs can then be inferred using the FPGA vendor's timing analysis tools. We illustrate the application of this methodology to the evaluation of various schemes of involving embedded elements that support floating-point computations","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114534293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Herbordt, Josh Model, Y. Gu, Bharat Sukhwani, T. Court
{"title":"Single Pass, BLAST-Like, Approximate String Matching on FPGAs","authors":"M. Herbordt, Josh Model, Y. Gu, Bharat Sukhwani, T. Court","doi":"10.1109/FCCM.2006.64","DOIUrl":"https://doi.org/10.1109/FCCM.2006.64","url":null,"abstract":"Approximate string matching is fundamental to bioinformatics, and has been the subject of numerous FPGA acceleration studies. We address issues with respect to FPGA implementations of both BLAST- and dynamic programming- (DP) based methods. Our primary contributions are two new algorithms for emulating the seeding and extension phases of BLAST. These operate in a single pass through a database at streaming rate (110 Maa/sec on a VP70 for query sizes up to 600 and 170 Maa/sec on a Virtex4 for query sizes up to 1024), and with no preprocessing other than loading the query string. Further, they use very high sensitivity with no slowdown. While current DP-based methods also operate at streaming rate, generating results can be cumbersome. We address this with a new structure for data extraction. We present results from several implementations","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116547479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer","authors":"G. R. Morris, V. Prasanna, Richard D. Anderson","doi":"10.1109/FCCM.2006.8","DOIUrl":"https://doi.org/10.1109/FCCM.2006.8","url":null,"abstract":"Supercomputer companies such as Cray, Silicon Graphics, and SRC Computers now offer reconfigurable computer (RC) systems that combine general-purpose processors (GPPs) with field-programmable gate arrays (FPGAs). The FPGAs can be programmed to become, in effect, application-specific processors. These exciting supercomputers allow end-users to create custom computing architectures aimed at the computationally intensive parts of each problem. This report describes a parameterized, parallelized, deeply pipelined, dual-FPGA, IEEE-754 64-bit floating-point design for accelerating the conjugate gradient (CG) iterative method on an FPGA-augmented RC. The FPGA-based elements are developed via a hybrid approach that uses a high-level language (HLL)-to-hardware description language (HDL) compiler in conjunction with custom-built, VHDL-based, floating-point components. A reference version of the design is implemented on a contemporary RC. Actual run time performance data compare the FPGA-augmented CG to the software-only version and show that the FPGA-based version runs 1.3 times faster than the software version. Estimates show that the design can achieve a 4 fold speedup on a next-generation RC","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129163375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}