ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors最新文献

Code generation for hardware accelerated AES 硬件加速AES的代码生成

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5540955

Raymond Manley, Paul Magrath, David Gregg

{"title":"Code generation for hardware accelerated AES","authors":"Raymond Manley, Paul Magrath, David Gregg","doi":"10.1109/ASAP.2010.5540955","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540955","url":null,"abstract":"Data must be encrypted if it is to remain confidential when sent over computer networks. Encryption solves many problems involving invasion of privacy, identity theft, fraud, and data theft. However for encryption to be widely used, it must be fast. The problem is so important that new Intel processors provide hardware support for encryption. These instructions implement key stages of the Advanced Encryption Standard (AES), allowing encryption to be completed more quickly and using less power. The AES algorithm consists of several 'rounds' of encryption, each of which involves a relatively complicated computation. This new hardware support allows an entire round to be implemented with just a single instruction. An implementation of the AES algorithm using these instructions contains several code sections that can be fine tuned for optimal performance. However, these optimizations are usually done by hand, which can be a lengthy, labour intensive process. We present a system that can generate billions of variants of the AES encryption code to find the best solution for a particular microarchitecture. We apply both common loop optimizations and ones specific to AES. We evaluate the generated code on hardware with built-in AES support using both selective-brute force and guided searches. Our generator achieves significant speedups over a straightforward implementation of the code.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130022198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Implementing decimal floating-point arithmetic through binary: Some suggestions 通过二进制实现十进制浮点运算:一些建议

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5540969

N. Brisebarre, N. Louvet, Érik Martin-Dorel, J. Muller, A. Panhaleux, M. Ercegovac

引用次数: 2

Design of throughput-optimized arrays from recurrence abstractions 从递归抽象中设计吞吐量优化数组

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5540753

A. Jacob, J. Buhler, R. Chamberlain

{"title":"Design of throughput-optimized arrays from recurrence abstractions","authors":"A. Jacob, J. Buhler, R. Chamberlain","doi":"10.1109/ASAP.2010.5540753","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5540753","url":null,"abstract":"Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2× faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2× speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4–5× faster than the currently used latency-optimized array. These novel arrays are 70–172× faster than a software baseline.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128503649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A fully-overlapped multi-mode QC-LDPC decoder architecture for mobile WiMAX applications 用于移动WiMAX应用的全重叠多模QC-LDPC解码器架构

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5540958

Bo Xiang, Dan Bao, Shuangqu Huang, Xiaoyang Zeng

引用次数: 11

A New approach in on-line task scheduling for reconfigurable computing systems 可重构计算系统在线任务调度的新方法

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5540975

M. M. Bassiri, H. Shahhoseini

引用次数: 17

A GALS FFT processor with clock modulation for low-EMI applications 具有时钟调制的低电磁干扰应用的GALS FFT处理器

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5541014

Xin Fan, M. Krstic, C. Wolf, E. Grass

引用次数: 9

A forwarding-sensitive instruction scheduling approach to reduce register file constraints in VLIW architectures 一种减少VLIW体系结构中寄存器文件约束的转发敏感指令调度方法

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5541015

G. P. Vayá, J. Martín-Langerwerf, H. Blume, P. Pirsch

{"title":"A forwarding-sensitive instruction scheduling approach to reduce register file constraints in VLIW architectures","authors":"G. P. Vayá, J. Martín-Langerwerf, H. Blume, P. Pirsch","doi":"10.1109/ASAP.2010.5541015","DOIUrl":"https://doi.org/10.1109/ASAP.2010.5541015","url":null,"abstract":"This paper presents a forwarding-based approach to increase the code compaction and consequently the processing performance of VLIW media-processors that implement monolithic or partitioned register file (RF) organizations with reduced number of read/write ports. This approach exploits the forwarding mechanism implemented in common pipelined VLIW architectures to reduce the number of RF accesses, which is one of the main limiting factors of the code compaction process. This RF access reduction enables a higher instruction scheduling efficiency and eventually decreases the power consumption, without requiring extra hardware. A forwarding-sensitive code generation algorithm based on an enhanced list scheduling algorithm is described in detail. In addition, three case studies are presented, where the proposed scheduling algorithm leads to performance improvements of up to 8.4% when running common image and video codec tasks on a generic VLIW architecture. This is attractively close to the maximum performance improvement (11.4%) that can be achieved when investing in hardware by using a RF with twice the number of ports.","PeriodicalId":175846,"journal":{"name":"ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131996449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Design of an Automatic Target Recognition algorithm on the IBM Cell Broadband Engine 基于IBM Cell宽带引擎的目标自动识别算法设计

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5540770

W. Che, Karam S. Chatha

引用次数: 1

Memoryless RNS-to-binary converters for the {2n+1 - 1, 2n, 2n - 1} moduli set 用于{2n+1 - 1,2n, 2n - 1}模集的无内存rs -二进制转换器

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5540979

K. Gbolagade, G. Voicu, S. Cotofana

引用次数: 6

Dynamic code mapping for limited local memory systems 有限的本地内存系统的动态代码映射

ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2010-07-07 DOI: 10.1109/ASAP.2010.5540773

S. Jung, Aviral Shrivastava, Ke Bai

引用次数: 38