{"title":"Bambu: A modular framework for the high level synthesis of memory-intensive applications","authors":"C. Pilato, Fabrizio Ferrandi","doi":"10.1109/FPL.2013.6645550","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645550","url":null,"abstract":"This paper presents bambu, a modular framework for research on high-level synthesis currently under development at Politecnico di Milano. It can accept most of C constructs without requiring any three-state for their implementations by exploiting a novel and efficient memory architecture. It also allows the integration of floating-point units and thus it can deal with a wide range of data types. Finally, it allows to easily customize the synthesis flow (e.g., transformation passes, constraints, options, synthesis scripts) through an XML file and it automatically generates test-benches and validates the results against the corresponding software execution, supporting both ASIC and FPGA technologies.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133264822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yusuke Koizumi, N. Miura, Yasuhiro Take, Hiroki Matsutani, T. Kuroda, H. Amano, Ryuichi Sakamoto, M. Namiki, K. Usami, Masaaki Kondo, Hiroshi Nakamura
{"title":"Demonstration of a heterogeneous multi-core processor with 3-D inductive coupling links","authors":"Yusuke Koizumi, N. Miura, Yasuhiro Take, Hiroki Matsutani, T. Kuroda, H. Amano, Ryuichi Sakamoto, M. Namiki, K. Usami, Masaaki Kondo, Hiroshi Nakamura","doi":"10.1109/FPL.2013.6645628","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645628","url":null,"abstract":"Cube-1 is a heterogeneous multi-core processor which can achieve the required performance with the least energy consumption as possible. It can control the performance and energy with two levels: (1) the number of accelerators can be easily changed by increasing or decreasing the number of stacked chips after fabrication, as they are connected with inductive coupling links. (2) The supply voltage for PE array of the accelerator can be controlled by the host CPU so that the required performance can be obtained with a minimum supply voltage.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132758107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generation of multi-core systems from multithreaded software","authors":"Alexander Wold, J. Tørresen, Andreas Agne","doi":"10.1109/FPL.2013.6645582","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645582","url":null,"abstract":"A heterogeneous system with soft CPU tailored to the individual threads of the application, while still software based, offers the potential for improved performance and resource utilization over a homogeneous system. In this paper we present a method to automatically create a heterogeneous multi-core system from a multithreaded software application. The resulting system consists of processing elements based on customized MIPS soft CPUs coupled with their respective programs. Using instruction set architecture (ISA) subsetting, we adapt the individual soft CPUs to the specific computations they have to perform. We have carried out a case study with a constraint solver application for which we find a performance increase of 1.54 accompanied with an area reduction of 22.5% compared to a homogeneous multi-core system. We also present an automated toolchain that generates synthesizable IP-cores from software threads with little additional development overhead.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"206 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115560086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated synthesis of FPGA-based heterogeneous interconnect topologies","authors":"A. Cilardo, E. Fusella, L. Gallo, A. Mazzeo","doi":"10.1109/FPL.2013.6645494","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645494","url":null,"abstract":"The choice of the communication topology in many systems is of vital importance because it affects the entire inter-component data traffic and impacts significantly the overall system performance and cost. On the other hand, there is a very large spectrum of interconnection topologies that potentially meet given communication requirements, determining various trade-offs between cost and performance. This work proposes an automated methodology to choose among all of these possibilities, avoiding a manual and time consuming design space search process. The methodology takes as input the description of the application communication requirements, and gives as output an on-chip synthesizable interconnection structure satisfying given area constraints. Targeted at FPGA technologies, the approach generates an interconnection structure combining crossbars and shared buses, connected through bridges, yielding a scalable, efficient structure. To the best of the authors' knowledge, it provides the first method to automatically generate FPGA-based communication architectures where heterogeneous communication elements, such as shared buses and crossbar switches, coexist in a network inherently supporting multiple communication paths. The resulting architecture improves the level of communication parallelism that can be exploited, while keeping area requirements low. The paper thoroughly describes the formalisms and the methodology used to derive such optimized heterogeneous topologies. It also discusses a couple of case-study applications emphasizing the impact of the proposed approach and highlighting the essential differences with a few other solutions in the literature.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121359436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Xiaomeng Huang, Youhui Zhang, Guangwen Yang
{"title":"Accelerating solvers for global atmospheric equations through mixed-precision data flow engine","authors":"L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Xiaomeng Huang, Youhui Zhang, Guangwen Yang","doi":"10.1109/FPL.2013.6645508","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645508","url":null,"abstract":"One of the most essential and challenging components in a climate system model is the atmospheric model. To solve the multi-physical atmospheric equations, developers have to face extremely complex stencil kernels. In this paper, we propose a hybrid CPU-FPGA algorithm that applies single and multiple FPGAs to compute the upwind stencil for the global shallow water equations. Through mixed-precision arithmetic, we manage to build a fully pipelined upwind stencil design on a single FPGA, which can perform 428 floating-point and 235 fixed-point operations per cycle. The CPU-FPGA algorithm using one Virtex-6 FPGA provides 100 times speedup over a 6-core CPU and 4 times speedup over a hybrid node with 12 CPU cores and a Fermi GPU card. The algorithm using four FPGAs provides 330 times speedup over a 6-core CPU; it is also 14 times faster and 9 times more power efficient than the hybrid CPU-GPU node.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123628974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An open-source multi-FPGA modular system for fair benchmarking of True Random Number Generators","authors":"V. Fischer, F. Bernard, Patrick Haddad","doi":"10.1109/FPL.2013.6645570","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645570","url":null,"abstract":"True Random Number Generators (TRNG) are cryptographic primitives that exploit intrinsic noise sources in electronic devices. Their quality is linked to the underlying technology, activity of the neighboring circuitry and device environment (temperature, power supply, electromagnetic emanations). Consequently, when comparing TRNGs, they should be tested in identical technology, system architecture and operating conditions. We present a unified hardware platform and related open source tools aimed at fair benchmarking of TRNGs implemented in different FPGA technologies. The platform is accessible remotely. Designers can download related tools from the web site and they can upload their configuration bitstream to the remote FPGA and download random data generated in the same hardware and in the same conditions as other concurrent designs and state-of-the-art generators. The proposed tools were approved in many applications and they guarantee safe acquisition of random sequences at data rates of up to 400 Mbits/s.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123642044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving autonomous soft-error tolerance of FPGA through LUT configuration bit manipulation","authors":"Anup Das, Shyamsundar Venkataraman, Akash Kumar","doi":"10.1109/FPL.2013.6645498","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645498","url":null,"abstract":"Soft-errors in LUT configuration bits of FPGAs can alter the functionality of an implemented design, rendering it useless, unless re-programmed. This paper proposes a technique to improve autonomous fault-masking capabilities of a design by maximizing the number of zeros or ones in LUTs. The technique utilizes spare resources (XOR gates and carry chain) of FPGA devices to selectively manipulate LUT contents using two operations - LUT restructuring and LUT decomposition. Experiments conducted with a wide set of benchmarks from MCNC, IWLS 2005 and ITC99 benchmark suite on Xilinx Virtex 6 FPGA board demonstrate that the proposed methodology maximizes logic 0/1 of LUTs by an average 20% achieving 80% fault-masking with no area overhead. The fault-rate of the entire design is reduced by 60% on average as compared to the existing techniques. Further, an additional 5% fault-masking can be achieved with a 7% increase in slice usage.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114157014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karel Heyse, Tom Davidson, Elias Vansteenkiste, Karel Bruneel, D. Stroobandt
{"title":"Efficient implementation of Virtual Coarse Grained Reconfigurable Arrays on FPGAS","authors":"Karel Heyse, Tom Davidson, Elias Vansteenkiste, Karel Bruneel, D. Stroobandt","doi":"10.1109/FPL.2013.6645516","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645516","url":null,"abstract":"Fine grained Field Programmable Gate Arrays (FPGA) are complex to program and therefore suffer from high development costs. To solve this problem, Virtual Coarse Grained Reconfigurable Arrays (Virtual CGRA), or CGRAs implemented on FPGAs, have been proposed. Conventional implementations of VCGRAs use functional FPGA resources, such as LookUp Tables, to implement the virtual switch blocks, registers and other components that make the VCGRA configurable. We show that this is a large overhead that can often be avoided by mapping these components directly on lower level FPGA resources such as physical switch blocks and configuration memory. We show how this can be achieved using the tool flow for parameterised FPGA configurations and illustrate the advantages of this method by showing that an area reduction of 50% is attainable for a VCGRA aimed at regular expression matching.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131302496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Forconesi, G. Sutter, S. López-Buedo, J. Aracil
{"title":"Accurate and flexible flow-based monitoring for high-speed networks","authors":"Marco Forconesi, G. Sutter, S. López-Buedo, J. Aracil","doi":"10.1109/FPL.2013.6645557","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645557","url":null,"abstract":"In this paper we present an FPGA-based architecture to export flows in 10 Gbps networks, implemented on the NetFPGA-10G platform. Flow-based monitoring is a powerful methodology to analyze and detect network issues, such as congested links or DDoS attacks. Our design provides the following advantages: (i) The architecture allows processing 10 Gbps links without sampling, even for the highest packet rate of 14.88 Mpps (Million packets per second) that corresponds to the shortest (64-byte) Ethernet frames; (ii) It is possible to manage up to 786,432 concurrent flows; (iii) The project is developed in an open-source hardware platform and the HDL code is open to the community; (iv) The proposed approach frees network routers from the burden of exporting flows.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114748091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy efficient parameterized FFT architecture","authors":"Ren Chen, H. Le, V. Prasanna","doi":"10.1109/FPL.2013.6645545","DOIUrl":"https://doi.org/10.1109/FPL.2013.6645545","url":null,"abstract":"In this paper, we revisit the classic Fast Fourier Transform (FFT) for energy efficient designs on FPGAs. A parameterized FFT architecture is proposed to identify the design trade-offs in achieving energy efficiency. We first perform design space exploration by varying the algorithm mapping parameters, such as the degree of vertical and horizontal parallelism, that characterize decomposition based FFT algorithms. Then we explore an energy efficient design by empirical selection on the values of the chosen architecture parameters, including the type of memory elements, the type of interconnection network and the number of pipeline stages. The trade offs between energy, area, and time are analyzed using two performance metrics: the energy efficiency (defined as the number of operations per Joule) and the Energy×Area×Time (EAT) composite metric. From the experimental results, a design space is generated to demonstrate the effect of these parameters on the various performance metrics. For N-point FFT (16 ≤ N ≤ 1024), our designs achieve up to 28% and 38% improvement in the energy efficiency and EAT, respectively, compared with a state-of-the-art design.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127850568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}