Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero
{"title":"MAPC: Memory access pattern based controller","authors":"Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero","doi":"10.1109/FPL.2014.6927397","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927397","url":null,"abstract":"Traditionally, system designers have attempted to improve system performance by scheduling the processing cores and by exploring different memory system configurations and there is comparatively less work done scheduling the accesses at the memory system level and exploring data accesses on the memory systems. In this paper, we propose a memory access pattern based controller (MAPC). MAPC organizes data accesses in descriptors, prioritizes them with respect to the number and size of transfer requests. When compared to the baseline multicore system, the MAPC based system achieves between 2.41× to 5.34× of speedup for different applications, consumes 28% less hardware resources and 13% less dynamic power.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133774800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power-efficient re-gridding architecture for accelerating Non-uniform Fast Fourier Transform","authors":"Umer I. Cheema, G. Nash, R. Ansari, A. Khokhar","doi":"10.1109/FPL.2014.6927451","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927451","url":null,"abstract":"This paper proposes a novel FPGA-based accelerator for the memory and compute-intense re-gridding process used in computation of Non-uniform Fast Fourier Transform (NuFFT). The re-gridding process interpolates arbitrary sampled data onto a uniform grid using an interpolation kernel function. This regridding step is considered one of the most time consuming step in entire NuFFT computation. We propose a memory-efficient technique based on the novel use of customizable hardware components such as FPGA block memory in First-In-First-Out (FIFO) configuration, fill-rate based arbiter, distributed RAM and an array of pipelined single precision floating point multipliers and adders. The proposed architecture exhibits high performance over a wide range of configurations and data-sizes. A speed-up of over 9.6 was achieved when compared with existing FPGA-based technique at a 7 times higher MFLOPS per watt. Compared to GPU based technique, over 6 times higher MFLOPS per watts were achieved.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132249053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient multi-standard cognitive radios on FPGAs","authors":"T. Pham, Suhaib A. Fahmy, I. Mcloughlin","doi":"10.1109/FPL.2014.6927380","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927380","url":null,"abstract":"Cognitive radios that support multiple standards and modify operation depending on environmental conditions are becoming more important as the demand for higher bandwidth and efficient spectrum use increases. Traditional implementations in custom ASICs cannot support such flexibility, with standards changing at a faster pace, while software baseband implementations fail to achieve the performance required. Hence, FPGAs offer an ideal platform bringing together flexibility, performance, and efficiency. This work explores the possible techniques for designing multi-standard radios on FPGAs, and explores how partial reconfiguration can be leveraged in a way that is amenable for domain experts with minimal FPGA knowledge.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130348965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Run-time accelerator binding for tile-based mixed-grained reconfigurable architectures","authors":"C. Diniz, M. Shafique, S. Bampi, J. Henkel","doi":"10.1109/FPL.2014.6927392","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927392","url":null,"abstract":"Run-time mixed-grained reconfigurable architectures emerged as an efficient solution to deal with the heterogeneous and at-design-time unpredictable nature of advanced applications. Due to interconnection limitations, the reconfigurable elements are grouped into tiles communicating through an on-chip network. State-of-the-art run-time accelerator binding schemes, i.e., mapping the accelerators to elements in the physical reconfigurable array, do not deal with such tile-based architectures. We propose a new scheme for run-time accelerator binding into our tile-based mixed-grained reconfigurable architecture. By means of an advanced video encoding application, we illustrate that our scheme reduces the inter-tile communication overhead by up to 44% (avg. 23%).","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114755097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The LEAP FPGA operating system","authors":"Kermin Fleming, Michael Adler","doi":"10.1007/978-3-319-26408-0_14","DOIUrl":"https://doi.org/10.1007/978-3-319-26408-0_14","url":null,"abstract":"","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117091675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rapid codesign of a soft vector processor and its compiler","authors":"Matthew Naylor, S. Moore","doi":"10.1109/FPL.2014.6927425","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927425","url":null,"abstract":"Despite a decade of activity in the development of soft vector processors for FPGAs, high-level language support remains thin. We attribute this problem to a design method in which the high-level vector programming interface is only really considered once the processor architecture has been perfected, by which point the designer may be committed to the time-consuming development of a complicated compiler. In this paper, we present the codesign of a soft vector processor and a lightweight compiler, which together lift the level of abstraction for the programmer while allowing a rapid compiler implementation phase.We demonstrate the effectiveness of our approach on a range of applications from digital signal processing, neuroscience, and machine learning.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"12 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122977904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interconnect for commodity FPGA clusters: Standardized or customized?","authors":"A. T. Markettos, P. Fox, S. Moore, A. Moore","doi":"10.1109/FPL.2014.6927472","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927472","url":null,"abstract":"We demonstrate that a small library of customizable interconnect components permits low-area, high-performance, reliable communication tuned to an application, by analogy with the way designers customize their compute. Whilst soft cores for standard protocols (Ethernet, RapidIO, Infiniband, Interlaken) are a boon for FPGA-to-other-system interconnect, we argue that they are inefficient and unnecessary for FPGA-to-FPGA interconnect. Using the example of BlueLink, our lightweight pluggable interconnect library, we describe how to construct reliable FPGA clusters from hundreds of lower-cost commodity FPGA boards. Utilizing the increasing number of serial links on FPGAs demands efficient use of soft-logic, making domain-optimized custom interconnect attractive for some time to come.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124796541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trade-offs and progressive adoption of FPGA acceleration in network traffic monitoring","authors":"Lukás Kekely, V. Pus, Pavel Benácek, J. Korenek","doi":"10.1109/FPL.2014.6927443","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927443","url":null,"abstract":"Current hardware acceleration cores for network traffic processing are often well optimized for one particular task and therefore provide high level of hardware acceleration. But for many applications, such as network traffic monitoring and security, it is also necessary to achieve rapid development cycle to provide fast response to security threats.We propose and evaluate a new concept of hardware acceleration for flexible flow-based network traffic monitoring with support of application protocol analysis. The concept is called Software Defined Monitoring (SDM) and it relies on a configurable hardware accelerator implemented in FPGA, coupled with smart monitoring tasks running as software on general CPU. The monitoring tasks in the software control the level of detail and type of information retained during the hardware processing. This arrangement allows rapid application prototyping in the software, followed by further shifting of the timing critical parts of the processing to the hardware accelerator. The concept is proposed with the scalability in mind, therefore it is suitable for different FPGA based platforms ranging from embedded single-chip solutions (such as Zynq or CycloneV) to high-speed backbone network monitoring boxes. Our pilot high-speed implementation using FPGA acceleration board in a commodity server performs a 100Gb/s flow traffic measurement augmented by a selected application protocol analysis.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128906470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A combination of multi-edge coding and independent coding lines for time-to-digital conversion","authors":"Dominik Sondej, R. Szplet","doi":"10.1109/FPL.2014.6927382","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927382","url":null,"abstract":"The paper describes a new method for time-to-digital conversion that allows achieving the conversion resolution far below the propagation time of the fastest delay buffer in integrated circuit (IC). The method is a combination of the multi-edge time coding and time digitization in independent coding lines. The implementation of such combination and assessment of its effectiveness are the main aims of this research. The article also describes the main design issues that were solved during the implementation of method in an FPGA device. They include: the generation of a pattern square signal with a certain amount of edges and possibly minimal delays between them, the elimination of bubble errors and reduction of internal interferences in IC.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122629311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable parallel architecture for singular value decomposition of large matrices","authors":"Unai Martinez-Corral, Koldo Basterretxea, Raul Finker","doi":"10.1109/FPL.2014.6927393","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927393","url":null,"abstract":"Singular Value Decomposition (SVD) is a key linear algebraic operation in many scientific and engineering applications, many of them involving high dimensionality datasets and real-time response. In this paper we describe a scalable parallel processing architecture for accelerating the SVD of large m × n matrices. Based on a linear array of simple processing-units (PUs), the proposed architecture follows a double data-flow paradigm (FIFO memories and a shared-bus) for optimizing the time spent in data transferences. The PUs, which perform elemental column-pair evaluations and rotations, have been designed for an efficient utilization of available FPGA resources and to achieve maximum algorithm speed-ups. The architecture is fully scalable from a two-PU scheme to an arrangement with as many as n/2 PUs. This allows for a trade-off between occupied area and processing acceleration in the final implementation, and permits the SVD processor to be implemented both on low-cost and high-end FPGAs. The system has been prototyped on Spartan-6 and Kintex-7 devices for performance comparison.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122292578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}