M. Kamal, N. K. Amiri, Arezoo Kamran, S. A. Hoseini, M. Dehyadegari, Hamid Noori
{"title":"Dual-purpose custom instruction identification algorithm based on Particle Swarm Optimization","authors":"M. Kamal, N. K. Amiri, Arezoo Kamran, S. A. Hoseini, M. Dehyadegari, Hamid Noori","doi":"10.1109/ASAP.2010.5541012","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5541012","url":null,"abstract":"Extending instruction set architecture (ISA) of embedded processors is an effective way to enhance performance and energy efficiency. The typical approaches for identifying custom instructions (CIs) limit the maximum number of input and output (I/O) operands to the available register file port. Recently, there are several work that explore CI candidates without imposing a limit on the number of input and output operands. In this paper, we present a new algorithm based on Particle Swarm Optimization (PSO) to identify CIs within a given data flow graph (DFG) and evaluate it for both categories of CI identification approaches (with and without I/O constrains). By novel evolving strategy, we enhance the quality of the results in our partitioning algorithm. Experimental results show that in most cases CI identification with I/O constraints based on PSO finds better or the same CIs in terms of performance compared to genetic algorithm (GA)[1] and ISEGEN [2] (96% and 90%, respectively). Comparing our proposed algorithm with [12] and [13] reveals that ours has a shorter run-time several order of magnitudes for large DFGs and is independent of the number of forbidden nodes. Moreover, we propose a modified version of PSO called Wrapper PSO that is up to 100× and 500× faster than GA and ISEGEN in large DFGs, respectively.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123261740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic generation of polynomial-based hardware architectures for function evaluation","authors":"F. D. Dinechin, Mioara Joldes, B. Pasca","doi":"10.1109/ASAP.2010.5540952","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540952","url":null,"abstract":"Polynomial approximation is a very general technique for the evaluation of a wide class of numerical functions of one variable. This article details an architecture generator that inputs the specification of a function and outputs a synthe-sizable description of an architecture evaluating this function with guaranteed accuracy. It improves upon the literature in two aspects. Firstly, it uses better polynomials, thanks to recent advances related to constrained-coefficient polynomial approximation. Secondly, it refines the error analysis of polynomial evaluation to reduce the size of the multipliers used. An open-source implementation is provided in the FloPoCo project, including architecture exploration heuristics designed to use efficiently the embedded memories and multipliers of high-end FPGAs. High-performance pipelined architectures for precisions up to 64 bits can be obtained in seconds.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"General-purpose FPGA platform for efficient encryption and hashing","authors":"Jakub Szefer, Yu-Yuan Chen, R. Lee","doi":"10.1109/ASAP.2010.5540976","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540976","url":null,"abstract":"Many applications require protection of secret or sensitive information, from sensor nodes and embedded applications to large distributed systems. The confidentiality of data can be protected by encryption using symmetric-key ciphers, and the integrity of the data can be ensured by using a cryptographic hash function to calculate a \"digital fingerprint.\" In this paper, we propose reconfigurable FPGA hardware components that enable rapid deployment of cryptographic and other algorithms. The novelty of our hardware components is in their general-purpose design which enables easy mappings of algorithms to allow customizations of data protection for different usage scenarios. Since we utilize only a small part of an FPGA chip, our design can be readily integrated with other processing needs of a mobile device, a sensor node or a System-on-Chip. Important block ciphers like the Advanced Encryption Standard (AES) as well as advanced cryptographic hash algorithms like Whirlpool map well onto our general-purpose components. Our solution facilitates easy hardware implementation of customizable encryption and hashing solutions, with area and speed performance comparable to custom FPGA implementations targeted at a specific cipher or hash algorithm. We achieve the best efficiency in Mbps/slice for Whirlpool. Furthermore, the components that we have proposed can be used for many other applications - not just for implementing block ciphers and cryptographic hash functions.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134010003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hoang, Ulf Jalmbrant, Erik der Hagopian, K. Subramaniyan, Magnus Själander, P. Larsson-Edefors
{"title":"Design space exploration for an embedded processor with flexible datapath interconnect","authors":"T. Hoang, Ulf Jalmbrant, Erik der Hagopian, K. Subramaniyan, Magnus Själander, P. Larsson-Edefors","doi":"10.1109/ASAP.2010.5540812","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540812","url":null,"abstract":"The design of an embedded processor is dependent on the application domain. Traditionally, design solutions specific to an application domain have been available in three forms: VLIW-based DSP processors, ASICs and FPGAs; each respectively offering generality of application domain, energy efficiency and flexibility. However, while matching the application domain to the resources needed, the design space becomes huge. We present FlexTools, a tool framework built around the FlexCore architecture to evaluate performance and energy efficiency for different applications. Here we demonstrate FlexTools for design space exploration with a focus on the data-routing flexibility of the FlexCore processor, in search of energy-efficient interconnect configurations that are both cycle-count and hardware efficient. Evaluation results suggest that a well-optimized instance of a 65-nm multiplier-extended FlexCore processor datapath, obtained using FlexTools, executes nine integer EEMBC benchmarks with a 15% cycle count reduction and dissipates 17% less energy than a reference MIPS datapath.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114184252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sujay Deb, A. Ganguly, Kevin Chang, P. Pande, B. Belzer, D. Heo
{"title":"Enhancing performance of network-on-chip architectures with millimeter-wave wireless interconnects","authors":"Sujay Deb, A. Ganguly, Kevin Chang, P. Pande, B. Belzer, D. Heo","doi":"10.1109/ASAP.2010.5540799","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540799","url":null,"abstract":"In a traditional Network-on-Chip (NoC), latency and power dissipation increase with system size due to its inherent multi-hop communications. The performance of NoC communication fabrics can be significantly enhanced by introducing long-range, low power, high bandwidth direct links between far apart cores. In this paper a design methodology for a scalable hierarchical NoC with on-chip millimeter (mm)-wave wireless links is proposed. The proposed wireless NoC offers significantly higher throughput and lower energy dissipation compared to its conventional multi-hop wired counterpart. It is also demonstrated that the proposed hierarchical NoC with long range wireless links shows significant performance gains in presence of various application-specific traffic and multicast scenarios.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130701606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient computation model for coarse grained reconfigurable architectures and its applications to a reconfigurable computer","authors":"Oguzhan Atak, A. Atalar","doi":"10.1109/ASAP.2010.5541009","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5541009","url":null,"abstract":"The mapping of high level applications onto the coarse grained reconfigurable architectures (CGRA) are usually performed manually by using graphical tools or when automatic compilation is used, some restrictions are imposed to the high level code. Since high level applications do not contain parallelism explicitly, mapping the application directly to CGRA is very difficult. In this paper, we present a middle level Language for Reconfigurable Computing (LRC). LRC is similar to assembly languages of microprocessors, with the difference that parallelism can be coded in LRC. LRC is an efficient language for describing control data flow graphs. Several applications such as FIR, multirate, multichannel filtering, FFT, 2D-IDCT, Viterbi decoding, UMTS and CCSDC turbo decoding, Wimax LDPC decoding are coded in LRC and mapped to the Bilkent Reconfigurable Computer with a performance (in terms of cycle count) close to that of ASIC implementations. The applicability of the computation model to a CGRA having low cost interconnection network has been validated by using placement and routing algorithms.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"404 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116076182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible hardware/software co-design for scalable elliptic curve cryptography for low-resource applications","authors":"Mohamed N. Hassan, M. Benaissa, A. Kanakis","doi":"10.1109/ASAP.2010.5540993","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540993","url":null,"abstract":"In this paper, we investigate the potential of the hardware/software co-design to realize a flexible-low resources elliptic curve cryptography (ECC) processor over binary finite fields GF(2m) on FPGA platforms. A design is proposed that is capable to work over different curves recommended by the ECC standards, namely, m = 163, 283, 571 without reconfiguring either the software or the hardware. The proposed hardware-software co-design is hosted on a free-so ft-core processor from Xilinx FPGA, namely the PicoBlaze. Two novel arithmetic circuits that represent the hardware environment are introduced to perform multi-precision arithmetic and scalable reduction over GF(2m). Furthermore, the proposed architecture is parameterized for different data widths (8, 16, 32 bits) to evaluate the optimal resource utilization versus performance trade-off to be made for the low resource-end application while still maintaining flexibility (scalability) across the chosen curves. The implementation of the flexible ECC processor consumes only 392 (51%) and 534 (62%) slices of the lowest cost chips from Xilinx Spartan III namely XC3S50 for 8 and 16-bits data paths, and 1278 (66%) slices for 32-bit data path on Spartan III XC3S200.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133562464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Loop transformations for interface-based hierarchies IN SDF graphs","authors":"J. Piat, S. Bhattacharyya, M. Raulet","doi":"10.1109/ASAP.2010.5540954","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540954","url":null,"abstract":"Data-flow has proven to be an attractive computation model for programming digital signal processing (DSP) applications. A restricted version of data-flow, termed synchronous data-flow (SDF), offers strong compile-time predictability properties, but has limited expressive power. A new type of hierarchy (Interface-based SDF) has been proposed allowing more expressivity while maintaining its predictability. One of the main problems with this hierarchical SDF model is the lack of trade-off between parallelism and network clustering. This paper presents a systematic method for applying an important class of loop transformation techniques in the context of interface-based SDF semantics. The resulting approach provides novel capabilities for integrating parallelism extraction properties of the targeted loop transformations with the useful modeling, analysis, and code reuse properties provided by SDF.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"370 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122853722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring algorithmic trading in reconfigurable hardware","authors":"S. Wray, W. Luk, P. Pietzuch","doi":"10.1109/ASAP.2010.5540966","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540966","url":null,"abstract":"This paper describes an algorithmic trading engine based on reconfigurable hardware, derived from a software implementation. Our approach exploits parallelism and reconfigurability of field-programmable gate array (FPGA) technology. FPGAs offer many benefits over software solutions, including a reduction in latency, while increasing overall throughput and computational density. All of which are important attributes to a successful algorithmic trading engine. Experiments show that the peak performance of our hardware architecture for algorithmic trading is 133 times faster than the corresponding software implementation. Six implementations can operate simultaneously on a Xilinx Vertex 5 xc5vlx30 FPGA on average, maximising performance and available resource usage.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126356457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jo Vliegen, N. Mentens, Jan Genoe, An Braeken, S. Kubera, A. Touhafi, I. Verbauwhede
{"title":"A compact FPGA-based architecture for elliptic curve cryptography over prime fields","authors":"Jo Vliegen, N. Mentens, Jan Genoe, An Braeken, S. Kubera, A. Touhafi, I. Verbauwhede","doi":"10.1109/ASAP.2010.5540977","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540977","url":null,"abstract":"This paper proposes an FPGA-based application-specific elliptic curve processor over a prime field. This research targets applications for which compactness is more important than speed. To obtain a small datapath, the FPGA's dedicated multipliers and carry-chain logic are used and no parallellism is introduced. A small control unit is obtained by following a microcode approach, in which the instructions are stored in the FPGA's Block RAM. The use of algorithms that prevent Simple Power Analysis (SPA) attacks creates an extra cost in latency. Nevertheless, the created processor is flexible in the sense that it can handle all finite field operations over 256-bit prime fields and all elliptic curves of a specified form. The comparison with other implementations on the same generation of FPGAs learns that our design occupies the smallest area.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130170626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}