{"title":"Code generation for hardware accelerated AES","authors":"Raymond Manley, Paul Magrath, David Gregg","doi":"10.1109/ASAP.2010.5540955","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540955","url":null,"abstract":"Data must be encrypted if it is to remain confidential when sent over computer networks. Encryption solves many problems involving invasion of privacy, identity theft, fraud, and data theft. However for encryption to be widely used, it must be fast. The problem is so important that new Intel processors provide hardware support for encryption. These instructions implement key stages of the Advanced Encryption Standard (AES), allowing encryption to be completed more quickly and using less power. The AES algorithm consists of several 'rounds' of encryption, each of which involves a relatively complicated computation. This new hardware support allows an entire round to be implemented with just a single instruction. An implementation of the AES algorithm using these instructions contains several code sections that can be fine tuned for optimal performance. However, these optimizations are usually done by hand, which can be a lengthy, labour intensive process. We present a system that can generate billions of variants of the AES encryption code to find the best solution for a particular microarchitecture. We apply both common loop optimizations and ones specific to AES. We evaluate the generated code on hardware with built-in AES support using both selective-brute force and guided searches. Our generator achieves significant speedups over a straightforward implementation of the code.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130022198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Brisebarre, N. Louvet, Érik Martin-Dorel, J. Muller, A. Panhaleux, M. Ercegovac
{"title":"Implementing decimal floating-point arithmetic through binary: Some suggestions","authors":"N. Brisebarre, N. Louvet, Érik Martin-Dorel, J. Muller, A. Panhaleux, M. Ercegovac","doi":"10.1109/ASAP.2010.5540969","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540969","url":null,"abstract":"We propose algorithms and provide some related results that make it possible to implement decimal floatingpoint arithmetic on a processor that does not have decimal operators, using the available binary floating-point functions. In this preliminary study, we focus on round-to-nearest mode only. We show that several functions in decimal32 and dec-imal64 arithmetic can be implemented using binary64 and binaryl28 floating-point arithmetic, respectively. We discuss the decimal square root and some transcendental functions. We also consider radix conversion algorithms.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133671565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of throughput-optimized arrays from recurrence abstractions","authors":"A. Jacob, J. Buhler, R. Chamberlain","doi":"10.1109/ASAP.2010.5540753","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540753","url":null,"abstract":"Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2× faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2× speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4–5× faster than the currently used latency-optimized array. These novel arrays are 70–172× faster than a software baseline.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128503649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fully-overlapped multi-mode QC-LDPC decoder architecture for mobile WiMAX applications","authors":"Bo Xiang, Dan Bao, Shuangqu Huang, Xiaoyang Zeng","doi":"10.1109/ASAP.2010.5540958","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540958","url":null,"abstract":"A fully-overlapped multi-mode QC-LDPC decoder architecture, adopting improved TDMP algorithm, is presented in this paper. With symmetrical four-stage pipelining, block column and row permutations, nonzero sub-matrix reordering, sum memory odd-even partition, and read-write bypass, two phases are fully overlapped and each phase scans nonzero sub-matrices one by one in block row-wise order without access conflicts to sum memories. The sum memories store not only variable node sums but also prior messages. In this case, it saves an additional FIFO of 13 440 bits. The decoder attains 248-287 Mb/s at 150 MHz and 15 iterations.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131184070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New approach in on-line task scheduling for reconfigurable computing systems","authors":"M. M. Bassiri, H. Shahhoseini","doi":"10.1109/ASAP.2010.5540975","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540975","url":null,"abstract":"Reconfiguration overhead is an important obstacle that limits the performance of on-line scheduling algorithms in reconfigurable computing systems and increases the overall execution time. Configuration reusing (task reusing) can decrease reconfiguration overhead considerably, particularly in periodic applications. In this paper, we present a new approach for on-line scheduling and placement in which configuration reusing is considered as a main characteristic in order to reduce reconfiguration overhead and decrease total execution time of the tasks. A large variety of experiments have been conducted on the proposed algorithm. Obtained results show considerable improvement in overall execution time of the tasks.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132647240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A GALS FFT processor with clock modulation for low-EMI applications","authors":"Xin Fan, M. Krstic, C. Wolf, E. Grass","doi":"10.1109/ASAP.2010.5541014","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5541014","url":null,"abstract":"With the growth in complexity of digital CMOS circuits, the steep current fluctuations introduced by numerous transistors switching with clock signals are proven to be a significant source of electromagnetic interference (EMI). In recent years the reduction in EMI noise from high speed digital ICs has already gained intensive research attention. In this paper the pausible clocking based globally asynchronous locally synchronous (GALS) design with phase and frequency modulation on the locally generated clocks is proposed as a systematic solution to EMI reduction. As a practical example, a 64-point Radix-23 pipelined GALS FFT processor was implemented using the IHP 130nm CMOS technology for low-EMI applications. The on-chip measurements demonstrate 13dB attenuation at the clock fundamental frequency and more than 20dB attenuation at higher clock harmonics, in comparison with the synchronous design.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133745808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. P. Vayá, J. Martín-Langerwerf, H. Blume, P. Pirsch
{"title":"A forwarding-sensitive instruction scheduling approach to reduce register file constraints in VLIW architectures","authors":"G. P. Vayá, J. Martín-Langerwerf, H. Blume, P. Pirsch","doi":"10.1109/ASAP.2010.5541015","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5541015","url":null,"abstract":"This paper presents a forwarding-based approach to increase the code compaction and consequently the processing performance of VLIW media-processors that implement monolithic or partitioned register file (RF) organizations with reduced number of read/write ports. This approach exploits the forwarding mechanism implemented in common pipelined VLIW architectures to reduce the number of RF accesses, which is one of the main limiting factors of the code compaction process. This RF access reduction enables a higher instruction scheduling efficiency and eventually decreases the power consumption, without requiring extra hardware. A forwarding-sensitive code generation algorithm based on an enhanced list scheduling algorithm is described in detail. In addition, three case studies are presented, where the proposed scheduling algorithm leads to performance improvements of up to 8.4% when running common image and video codec tasks on a generic VLIW architecture. This is attractively close to the maximum performance improvement (11.4%) that can be achieved when investing in hardware by using a RF with twice the number of ports.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131996449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memoryless RNS-to-binary converters for the {2n+1 - 1, 2n, 2n - 1} moduli set","authors":"K. Gbolagade, G. Voicu, S. Cotofana","doi":"10.1109/ASAP.2010.5540979","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540979","url":null,"abstract":"In this paper, we propose two novel memoryless reverse converters for the moduli set {2n+1 – 1,2n, 2n – 1}. The first proposed converter does not entirely cover the dynamic range while the second proposed converter covers the entire dynamic range. First, we simplify the Chinese Remainder Theorem in order to obtain a reverse converter that utilizes mod-(2n+1 – 1) operation. Second, we further reduce the resulting architecture to obtain a reverse converter that uses only carry save adders and carry propagate adders. FPGA implementation results indicate that, on average, the proposed limited dynamic range converter achieves about 42% area reduction. However, the second proposed converter provides only 29.48% area reduction when compared with the most effective equivalent state of the art converter. Both of the proposed converters also exhibit a small speed improvement over the state of the art equivalent converter.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117311407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic code mapping for limited local memory systems","authors":"S. Jung, Aviral Shrivastava, Ke Bai","doi":"10.1109/ASAP.2010.5540773","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540773","url":null,"abstract":"This paper presents heuristics for dynamic management of application code on limited local memories present in high-performance multi-core processors. Previous techniques formulate the problem using call graphs, which do not capture the temporal ordering of functions. In addition, they only use a conservative estimate of the interference cost between functions to obtain a mapping. As a result previous techniques are unable to achieve efficient code mapping. Techniques proposed in this paper overcome both these limitations and achieve superior code mapping. Experimental results from executing benchmarks from MiBench onto the Cell processor in the Sony Playstation 3 demonstrate up to 29% and average 12% performance improvement, at tolerable compile-time overhead.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123334452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A formal specification of fault-tolerance in prospecting asteroid mission with Reactive Autonomie Systems Framework","authors":"Heng Kuang, O. Ormandjieva, S. Klasa, J. Bentahar","doi":"10.1109/ASAP.2010.5540769","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540769","url":null,"abstract":"The NASA's Autonomous Nano Technology Swarm (ANTS) is a generic mission architecture consisting of miniaturized, autonomous, self-similar, reconfigurable, and addressable components forming structures. The Prospecting Asteroid Mission (PAM) is one of ANTS applications for survey of large dynamic populations. In this paper, we propose a formal approach based on Category Theory to specify the fault-tolerance property in PAM by Reactive Autonomie Systems Framework.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123612733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}