{"title":"An FPGA hardware acceleration of the indirect calculation of tree lengths method for phylogenetic tree reconstruction","authors":"Henry Block, T. Maruyama","doi":"10.1109/FPL.2014.6927430","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927430","url":null,"abstract":"In this work, we present an FPGA hardware implementation for a phylogenetic tree reconstruction with maximum parsimony algorithm. We base our approach on a particular stochastic local search algorithm that uses the Indirect Calculation of Tree Lengths method and the Progressive Neighborhood. In our implementation, we define a tree structure, and accelerate the search by parallel and pipeline processing. We show results for six real-world biological datasets. We compare execution times against our previous hardware approach, and TNT, the fastest available parsimony program. Acceleration rates between 34 to 45 per rearrangement, and 2 to 6, for the whole search, are obtained against our previous approach. Acceleration rates between 2 to 4 per rearrangement, and 18 to 112, for the whole search, are obtained against TNT. We estimate that these acceleration rates could increase for even larger datasets.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126947514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A bit-interleaved embedded hamming scheme to correct single-bit and multi-bit upsets for SRAM-based FPGAs","authors":"Shyamsundar Venkataraman, Rui Santos, Anup Das, Akash Kumar","doi":"10.1109/FPL.2014.6927385","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927385","url":null,"abstract":"Single Event Upsets (SEUs) inadvertently change the configuration bits of Static-RAM (SRAM)-based Field Programmable Gate Arrays (FPGAs), leading to erroneous output until the error has been corrected. Scrubbing using an Error Correction Code (ECC) such as hamming is a popular method to correct such faults. However, current works either require a large external memory to store the ECCs or can at most correct only one error in a frame. This paper proposes a novel bit-interleaved embedded hamming scheme along with scrubbing, to correct single (SBUs) and multi-bit upsets (MBUs) in SRAM-based FPGAs. This scheme does not require an external memory to store the ECCs, as they are embedded within the configuration memory itself. Experiments conducted on various benchmarks show that the proposed scheme can handle multiple errors per frame very well, with an embedding efficiency of over 99.3%.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126876423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic high-level synthesis of multi-threaded hardware accelerators","authors":"Jens Huthmann, J. Oppermann, A. Koch","doi":"10.1109/FPL.2014.6927432","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927432","url":null,"abstract":"We describe extending the hardware/software co-compiler Nymble to automatically generate multi-threaded (SIMT) hardware accelerators. In contrast to prior work that simply duplicated complete compute units for each thread, Nymble-MT reuses the actual computation elements, and adds just the required data storage and context switching logic. On the CHStone benchmark suite and a sample configuration of four threads, the prototype can up to quadruple the throughput, but with a chip area just 5% larger than that of a single-threaded accelerator.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127903755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A scalable, serially-equivalent, high-quality parallel placement methodology suitable for modern multicore and GPU architectures","authors":"C. Fobel, G. Grewal, D. Stacey","doi":"10.1109/FPL.2014.6927481","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927481","url":null,"abstract":"Placement and routing run-times continue to dominate the automated FPGA design flow. As the size of FPGA architectures continue to grow exponentially, it remains critical to develop parallel tools for FPGA design where the amount of exposed concurrent work scales with the size of the designs to be synthesized. In this paper, we propose a novel algorithm for parallel placement, based on simulated annealing, where the amount of parallel work directly scales with the size of the net-list to be placed. Our approach concurrently evaluates and conditionally applies very large sets of non-conflicting swaps using common parallel computing primitives, including stream compaction, category reduction, and sort. While our design is suitable for targeting all modern parallel computing platforms, we present results from our implementation which targets NVIDIA's CUDA platform, where we achieve a mean speed-up of 19x over VPR with post-routing critical-path-delay and wire-length quality that matches or exceeds VPR. We believe that this work is an important step towards the development of a scalable, high-quality placement tool.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116550733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HyPER: A runtime reconfigurable architecture for monte carlo option pricing in the Heston model","authors":"Christian Brugger, C. D. Schryver, N. Wehn","doi":"10.1109/FPL.2014.6927458","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927458","url":null,"abstract":"High-speed and energy-efficient computations are mandatory in the financial and insurance industry to survive in competition and meet the federal reporting requirements. On a hybrid CPU/FPGA system we propose a modular pricing engine and derive a novel algorithmic extension able to exploit online dynamic reconfiguration. The result is a high-performance and energy-efficient pricing system suitable for exotic option pricing in the state-of-the-art Heston market model. With the online reconfiguration extension our hybrid pricing system is nearly two orders of magnitude faster than high-end Intel CPUs, while consuming the same power.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116574663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syed M. A. H. Jafri, G. Serrano, M. Daneshtalab, Naeem Abbas, A. Hemani, K. Paul, J. Plosila, H. Tenhunen
{"title":"TransPar: Transformation based dynamic Parallelism for low power CGRAs","authors":"Syed M. A. H. Jafri, G. Serrano, M. Daneshtalab, Naeem Abbas, A. Hemani, K. Paul, J. Plosila, H. Tenhunen","doi":"10.1109/FPL.2014.6927485","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927485","url":null,"abstract":"Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications (e.g. 4G, CDMA, etc.). Recently proposed CGRAs offer runtime parallelism to reduce energy consumption (by lowering voltage/frequency). To implement the runtime parallelism, CGRAs commonly store multiple compile-time generated implementations of an application (with different degree of parallelism) and select the optimal version at runtime. However, the compile-time binding incurs excessive configuration memory overheads and/or is unable to parallelize an application even when sufficient resources are available. As a solution to this problem, we propose Transformation based dynamic Parallelism (TransPar). TransPar stores only a single implementation and applies a series for transformations to generate the bitstream for the parallel version. In addition, it also allows to displace and/or rotate an application to parallelize in resource constrained scenarios. By storing only a single implementation, TransPar offers significant reductions in configuration memory requirements (up to 73% for the tested applications), compared to state of the art compaction techniques. Simulation and synthesis results, using real applications, reveal that the additional flexibility allows up to 33% energy reduction compared to static memory based parallelism techniques. Gate level analysis reveals that TransPar incurs negligible silicon (0.2% of the platform) and timing (6 additional cycles per application) penalty.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134579507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-throughput implementation of a million-point sparse Fourier Transform","authors":"Abhinav Agarwal, Haitham Hassanieh, Omid Salehi-Abari, Ezzeldin Hamed, D. Katabi, Arvind","doi":"10.1109/FPL.2014.6927450","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927450","url":null,"abstract":"The emergence of data-intensive problems in areas like computational biology, astronomy, medical imaging, etc. has emphasized the need for fast and efficient very large Fourier Transforms. Recent work has shown that we can compute million-point transforms efficiently provided the data is sparse in the frequency domain. Processing input samples at rates approaching 1 GHz would allow real-time processing in several such applications. In this paper, we present a high-throughput FPGA implementation that performs a million-point sparse Fourier Transform on frequency-sparse input data, generating the largest 500 frequency component locations and values every 1.16 milliseconds. This design can process streamed input data at 0.86 Giga samples per second, and does not make any assumptions of the distribution of the frequency components beyond sparsity.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121063275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Ahari, Behnam Khaleghi, Zahra Ebrahimi, H. Asadi, M. Tahoori
{"title":"Towards dark silicon era in FPGAs using complementary hard logic design","authors":"A. Ahari, Behnam Khaleghi, Zahra Ebrahimi, H. Asadi, M. Tahoori","doi":"10.1109/FPL.2014.6927504","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927504","url":null,"abstract":"While the transistor density continues to grow exponentially in Field-Programmable Gate Arrays (FPGAs), the increased leakage current of CMOS transistors act as a power wall for the aggressive integration of transistors in a single die. One recently trend to alleviate the power wall in FPGAs is to turn off inactive regions of the silicon die, referred to as dark silicon. This paper presents a reconfigurable architecture to enable effective fine-grained power gating of unused Logic Blocks (LBs) in FPGAs. In the proposed architecture, the traditional soft logic is replaced with Mega Cells (MCs), each consists of a set of complementary Generic Reconfigurable Hard Logic (GRHL) and a conventional Look-Up Table (LUT). Both GRHL cells and LUTs can be power gated and turned off by controlling configuration bits. In the proposed MC, only one cell is active and the others are turned off. Experimental results on MCNC benchmark suite reveal that the proposed architecture reduces the critical path delay, power, and Power Delay Product (PDP) of LBs up to 5.3%, 30.4%, and 28.8% as compared to the equivalent LUT-based architecture.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128417714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PR-HMPSoC: A versatile partially reconfigurable heterogeneous Multiprocessor System-on-Chip for dynamic FPGA-based embedded systems","authors":"T. D. A. Nguyen, Akash Kumar","doi":"10.1109/FPL.2014.6927492","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927492","url":null,"abstract":"FPGA-based heterogeneous Multiprocessor Systems-on-Chip (HMPSoCs) are becoming quite popular for high performance embedded systems because of their powerful computational ability and relatively flexible architecture to adapt to unexpected system requirement changes. However, with the insatiable demands of supporting an extensive range of applications beyond the limited resources of FPGA chip and shorter time-to-market, many research works on partially reconfigurable (PR) FPGA architectures have been conducted to fulfill the needs. Those have yet to fully provide a versatile framework to exploit the flexibility of PR such as hardware/software task migration and bitstream relocation; more importantly, the on-chip debug features to access all processors currently loaded in the system are compromised because of the lack of native-support from vendor tools. In this paper, a novel PR-HMPSoC architecture for dynamic FPGA-based embedded system is proposed to provide solutions for all of the above issues. The results from the experimental system consisting of one static Microblaze and three PR Microblaze/hardware accelerators connected by a Network-on-Chip show that the architecture is very promising with just 8% reduction in operating frequency.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"211 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114334173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Garzia, A. Rügamer, R. Koch, P. Neumaier, E. Serezhkina, M. Overbeck, G. Rohmer
{"title":"Experimental multi-FPGA GNSS receiver platform","authors":"F. Garzia, A. Rügamer, R. Koch, P. Neumaier, E. Serezhkina, M. Overbeck, G. Rohmer","doi":"10.1109/FPL.2014.6927399","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927399","url":null,"abstract":"This paper describes the system architecture and implementation results of a robust and flexible dual-frequency 2×2 array processing GNSS receiver platform. A digital front-end FPGA pre-processes the incoming raw ADC data and implements interference mitigation methods in time and frequency domain. An optional second FPGA card can be used to realize more sophisticated and computational complex interference mitigation techniques. Finally, the data stream is processed on a baseband FPGA platform with spatial array processing techniques using a software assisted hardware GNSS receiver approach. The interconnection of the FPGAs is realized using gigabit transceivers handling a constant raw data rate of 16.8 Gbit/s.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127341986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}