{"title":"Three-stage pipeline implementation for SHA2 using data forwarding","authors":"Anh-Tuan Hoang, K. Yamazaki, S. Oyanagi","doi":"10.1109/FPL.2008.4629903","DOIUrl":"https://doi.org/10.1109/FPL.2008.4629903","url":null,"abstract":"The security hash algorithm 512 (SHA-512), which is used to verify the integrity of a message, involves computation iterations on data. The huge computation delay generated in that iteration limits the entire throughput of the system, and makes it difficult to pipeline the computation. To shorten the computation time in an iteration of the main loop, we used the data forwarding method. Here we introduce an architecture that simultaneously does data computation of an iteration and data movement of the next one. Then the computations are broken into two stages for one operand and three stages for another operand. The implementation occupies 1,520 hardware slices on Xilinx Virtex-4 family FPGA chip, and achieves nearly 2.2 Gbps. Thus, the implementation achieved a better area performance rate (throughput/area) in comparison with the related work.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115905719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panagiotis Afratis, E. Sotiriades, Grigorios Chrysos, Sotiria Fytraki, D. Pnevmatikatos
{"title":"A rate-based prefiltering approach to blast acceleration","authors":"Panagiotis Afratis, E. Sotiriades, Grigorios Chrysos, Sotiria Fytraki, D. Pnevmatikatos","doi":"10.1109/FPL.2008.4630026","DOIUrl":"https://doi.org/10.1109/FPL.2008.4630026","url":null,"abstract":"DNA sequence comparison and database search have evolved in the last years as a field of strong competition between several reconfigurable hardware computing groups. In this paper we present a BLAST preprocessor that efficiently marks the parts of the database that may produce matches. Our prefiltering approach offers significant reduction in the size of the database that needs to be fully processed by BLAST, with a corresponding reduction in the run-time of the algorithm. We have implemented our architecture, evaluated its effectiveness for a variety of databases and queries, and compared its accuracy against the original NCBI Blast implementation. We have found that prefiltering offers at least a factor of 5 and up to 3 orders of magnitude reduction in the database space that needs to be fully searched. Due to its prefiltering nature, our approach can be combined with all major reconfigurable acceleration architectures that have been presented up to date.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"s3-41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130175003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient FPGA mapping of Gilbert’s algorithm for SVM training on large-scale classification problems","authors":"Markos Papadonikolakis, C. Bouganis","doi":"10.1109/FPL.2008.4629968","DOIUrl":"https://doi.org/10.1109/FPL.2008.4629968","url":null,"abstract":"Support vector machines (SVMs) are an effective, adaptable and widely used method for supervised classification. However, training an SVM classifier on large-scale problems is proven to be a very time-consuming task for software implementations. This paper presents a scalable high-performance FPGA architecture of Gilbertpsilas Algorithm on SVM, which maximally utilizes the features of an FPGA device to accelerate the SVM training task for large-scale problems. Initial comparisons of the proposed architecture to the software approach of the algorithm show a speed-up factor range of three orders of magnitude for the SVM training time, regarding a wide range of datapsilas characteristics.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134174563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA architecture for the Pagerank eigenvector problem","authors":"Séamas McGettrick, D. Geraghty, Ciarán McElroy","doi":"10.1109/FPL.2008.4629999","DOIUrl":"https://doi.org/10.1109/FPL.2008.4629999","url":null,"abstract":"Googlepsilas PageRank (PR) eigenvector problem is the worldpsilas largest matrix calculation. The algorithm is dominated by Sparse Matrix by Vector Multiplication (SMVM) where the matrix is very sparse, unsymmetrical and unstructured. The computation presents a serious challenge to general-purpose processors (GPP) and the result is a very lengthy computation time. In this paper, we present an architecture for solving the PR eigenvalue problem on the Virtex 5 FPGA. The architecture is optimised to take advantage of the unique features of the PR algorithm and FPGA technology. Performance benchmarks are presented for a selection of real Internet link matrices. Finally these results are compared with equivalent GPP implementations of the PR algorithm.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130719089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Shelburne, C. Patterson, P. Athanas, Mark T. Jones, B. Martin, Ryan Fong
{"title":"Metawire: Using FPGA configuration circuitry to emulate a Network-on-Chip","authors":"M. Shelburne, C. Patterson, P. Athanas, Mark T. Jones, B. Martin, Ryan Fong","doi":"10.1049/iet-cdt.2009.0009","DOIUrl":"https://doi.org/10.1049/iet-cdt.2009.0009","url":null,"abstract":"While there have been many reported implementations of networks-on-chip (NoCs) on FPGAs, they have not seen the same acceptance as NoCs on ASICs. One reason is that communication on an FPGA is already costly due to the die resources and time delays inherent in the reconfigurable structure. Layering another general-purpose network on top of the reconfigurable network simply incurs too many performance penalties. There is, however, already a largely unused, global network available in FPGAs. As a proof-of-concept, we demonstrate that the Xilinx FPGA configuration circuitry, which is normally idle during system operation, can function as a relatively high-performance NoC. MetaWire performs transfers through an overclocked Virtex-4 internal configuration access port (ICAP) and is shown to provide a bandwidth exceeding 200 MBytes/sec.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133403289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Claus, Bin Zhang, W. Stechele, L. Braun, M. Hübner, J. Becker
{"title":"A multi-platform controller allowing for maximum Dynamic Partial Reconfiguration throughput","authors":"Christopher Claus, Bin Zhang, W. Stechele, L. Braun, M. Hübner, J. Becker","doi":"10.1109/FPL.2008.4630002","DOIUrl":"https://doi.org/10.1109/FPL.2008.4630002","url":null,"abstract":"Dynamic and partial reconfiguration (DPR) is a special feature offered by Xilinx Field Programmable Gate Arrays (FPGAs), giving the designer the ability to reconfigure a certain portion of the FPGA during run-time without influencing the other parts. This feature allows the hardware to be adaptable to any potential situation. For some applications, such as video-based driver assistance, the time needed to exchange a certain portion of the device might be critical. This paper addresses problems, limitations and results of on-chip reconfiguration that enable the user to decide whether DPR is suitable for a certain design prior to its implementation. A method is therefore introduced to calculate the expected reconfiguration throughput and latency. In addition, an IP core is presented that enables fast on-chip DPR close to the maximum achievable speed. Compared to an alternative state-of-the art realization, an increase in speed by a factor of 58 can be obtained.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114771360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bitstream compression techniques for Virtex 4 FPGAs","authors":"R. Stefan, S. Cotofana","doi":"10.1109/FPL.2008.4629952","DOIUrl":"https://doi.org/10.1109/FPL.2008.4629952","url":null,"abstract":"This paper examines the opportunity of using compression for accelerating the (re)configuration of FPGA devices, focusing on the choice of compression algorithms, and their hardware implementation cost. As our purpose is the acceleration of the configuration process, estimating the decoder speed also plays a major role in our study. We evaluate a wide range of well-established compression algorithms and we also propose two methods specifically developed for compressing FPGA configuration bitstreams, one based on a static dictionary and the other on arithmetic coding. For the arithmetic coding we propose a statistical model that takes advantage of the particularities of the configuration bitstreams of the Virtex 4 FPGA family. We evaluate the efficiency of the proposed methods along with state of the art compression algorithms on a number of benchmark circuits, some selected from the available open source implementations and some synthetically generated. Our evaluations indicate that using modest resources we can achieve parity and even exceed comercial software in terms of compression ratio, and outperform all other traditional algorithms. All our implemented decompressors are shown to use less than 1.5% of the slices available on the FPGA device.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114973436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Becker, P. Jamieson, W. Luk, P. Cheung, T. Rissa
{"title":"Towards benchmarking energy efficiency of reconfigurable architectures","authors":"Tobias Becker, P. Jamieson, W. Luk, P. Cheung, T. Rissa","doi":"10.1109/FPL.2008.4630041","DOIUrl":"https://doi.org/10.1109/FPL.2008.4630041","url":null,"abstract":"Energy research in reconfigurable architectures often involves legacy benchmarks such as the MCNC benchmarks. These benchmarks, however, are not well-suited for assessing energy consumption of reconfigurable technology, since they lack realistic input stimuli. This paper reviews and categorises a range of computation system benchmarks, and shows that there are no comprehensive benchmarks targeting reconfigurable architectures that would stimulate energy or power research. We review existing energy research in the field which involves microbenchmarks, in-house designs, or legacy benchmark suites used to evaluate power optimisations.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116371162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mapping and scheduling with task clustering for heterogeneous computing systems","authors":"Y. Lam, J. Coutinho, W. Luk, P. Leong","doi":"10.1109/FPL.2008.4629944","DOIUrl":"https://doi.org/10.1109/FPL.2008.4629944","url":null,"abstract":"This paper presents a new approach for mapping task graphs to heterogeneous hardware/software computing systems using heuristic search techniques. Two techniques: (1) integration of clustering, mapping, and scheduling in a single step and (2) multiple neighborhood functions strategy are proposed to enhance quality of mapping/scheduling solutions. Our approach is demonstrated by case studies involving 40 randomly generated task graphs, as well as four real applications including signal processing and pattern recognition. Experimental results show that the proposed integrated approach outperforms a separate approach in terms of quality of the mapping/scheduling solution by up to 18.3% for a heterogeneous system which includes a microprocessor, a floating-point digital signal processor, and an FPGA.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128442346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A dynamic temperature control simulation system for FPGAs","authors":"Shilpa Bhoj, D. Bhatia","doi":"10.1109/FPL.2008.4630033","DOIUrl":"https://doi.org/10.1109/FPL.2008.4630033","url":null,"abstract":"Rapid increases in transistor density, clock speeds and competition with custom ICs have escalated the demand for aggressive solutions to battle rising operating temperatures in programmable fabrics. In this work, we make several key contributions to temperature management in FPGAs. We develop a novel and robust simulation framework exploring adaptive techniques to reduce on chip temperatures in the reconfigurable core. We implement a thermal driven voltage scaling algorithm based on temperature and performance feedback. Our performance estimation model is an accurate empirical relation between delay, supply voltage and temperature with an average error of 9%. Our final results show significant temperature reductions of up to 13.37degC accompanied by the added benefit of power savings averaging 13.48%. Overheads are limited to an average reduction in worst case operating frequency of 10.78% and a voltage swing of 0.61V.","PeriodicalId":137963,"journal":{"name":"2008 International Conference on Field Programmable Logic and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128690750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}