{"title":"Concurrent systolic architecture for high-throughput implementation of 3-dimensional discrete wavelet transform","authors":"B. K. Mohanty, P. Meher","doi":"10.1109/ASAP.2008.4580172","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580172","url":null,"abstract":"In this paper, we present a novel systolic architecture for high-throughput computation of 3-dimensional (3-D) discrete wavelet transform (DWT). The entire 3-D DWT computation is decomposed into three distinct stages and implemented concurrently in a linear array of fully pipelined processing elements (PE). The proposed structure for 3-D DWT provides higher throughput than the existing architecture; and involves nearly half or less the number of multipliers and adders; and less on-chip memory (when normalized for unit throughput rate) than the other. Most importantly, the proposed one does not require any frame buffer unlike the other to perform inter-frame DWT computation. The proposed structure has a small latency and can perform 3-D DWT computation with 100% hardware unitization efficiency.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133243548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. C. Atici, L. Batina, Junfeng Fan, I. Verbauwhede, S. Yalcin
{"title":"Low-cost implementations of NTRU for pervasive security","authors":"A. C. Atici, L. Batina, Junfeng Fan, I. Verbauwhede, S. Yalcin","doi":"10.1109/ASAP.2008.4580158","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580158","url":null,"abstract":"NTRU is a public-key cryptosystem based on the shortest vector problem in a lattice which is an alternative to RSA and ECC. This work presents a compact and low power NTRU design that is suitable for pervasive security applications such as RFIDs and sensor nodes. We have designed two architectures, one is only capable of encryption and the other one performs both encryption and decryption. The strategy for the designs includes clock gating of registers, operand isolation and precomputation. This work is also the first one to present a complete NTRU design with encryption/decryption circuitry. Our encryption-only NTRU design has a gate-count of 2.8 kgates and dynamic power consumption of 1.72 muW. Moreover, encryption-decryption NTRU design consumes about 6 muW dynamic power and consists of 10.5 kgates.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121863353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Sun, Yuming Zhu, M. Goel, Joseph R. Cavallaro
{"title":"Configurable and scalable high throughput turbo decoder architecture for multiple 4G wireless standards","authors":"Yang Sun, Yuming Zhu, M. Goel, Joseph R. Cavallaro","doi":"10.1109/ASAP.2008.4580180","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580180","url":null,"abstract":"In this paper, we propose a novel multi-code turbo decoder architecture for 4G wireless systems. To support various 4G standards, a configurable multi-mode MAP (maximum a posteriori) decoder is designed for both binary and duo-binary turbo codes with small resource overhead (less than 10%) compared to the single-mode architecture. To achieve high data rates in 4G, we present a parallel turbo decoder architecture with scalable parallelism tailored to the given throughput requirements. High-level parallelism is achieved by employing contention-free interleavers. Multi-banked memory structure and routing network among memories and MAP decoders are designed to operate at full speed with parallel interleavers. We designed a very low-complexity recursive on-line address generator supporting multiple interleaving patterns, which avoids the interleaver address memory. Design trade-offs in terms of area and power efficiency are explored to find the optimal architectures. A 711 Mbps data rate is feasible with 32 Radix-4 MAP decoders running at 200 MHz clock rate.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116680236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Alle, Keshavan Varadarajan, R. Ramesh, Joseph Nimmy, Alexander Fell, Adarsha Rao, S. Nandy, R. Narayan
{"title":"Synthesis of application accelerators on Runtime Reconfigurable Hardware","authors":"M. Alle, Keshavan Varadarajan, R. Ramesh, Joseph Nimmy, Alexander Fell, Adarsha Rao, S. Nandy, R. Narayan","doi":"10.1109/ASAP.2008.4580147","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580147","url":null,"abstract":"Application accelerators are predominantly ASICs. The cost of ASIC solutions are order of magnitudes higher than programmable processing cores. Despite this, ASIC solutions are preferred when both high performance and low power is the target. ASICs offer no flexibility in terms of it being able to cater to application derivatives, unless this has been provisioned for at the time of design. In this paper we define the architecture of Runtime Reconfigurable Hardware (RRH) as the platform for application acceleration. The proposed RRH is a homogeneous fabric comprising computing, storage and communicating resources. We also propose a synthesis methodology to realize application written a high level language (HLL) on the RRH. Applications described in HLL is compiled into application substructures. For each application substructure a set of Compute Elements interconnected in a manner that closely matches the communication pattern within it, is allocated. CEs in such a configuration is called a hardware affine. Hardware Affines are carved out on the RRH at runtime. These hardware affines are defined at compile time, and are provisioned at runtime on the fabric. By virtue of the fact that these hardware affines are NOT instruction set processor cores or Logic Elements as in FPGAs, we bear the performance and power advantage of an ASIC, and the hardware reconfigurability/programmability of that of an FPGA/Instruction Set Processor.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"15 7-8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132844933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient digital circuit for implementing Sequence Alignment algorithm in an extended processor","authors":"V. Kundeti, Yunsi Fei, S. Rajasekaran","doi":"10.1109/ASAP.2008.4580171","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580171","url":null,"abstract":"The problem of sequence alignment (Edit Distance) between a pair of strings has been well studied in the field of computing algorithms. The classic dynamic programming-based algorithm, Needleman-Wunsch (O(n2)), has been widely used in practice, especially by biologists to find similarities between gene sequences. Any optimization in the implementation of this algorithm will have a significant practical impact on biological research. However, within the past several decades, not much has been done in improving the runtime of the algorithm in real implementations. Although algorithms based on systolic processor arrays and FPGAs were presented earlier to create custom hardware to aid in speed-up, their usage has been very limited due to their inherent synchronous design complexity and scalability issues. In view of this, we propose an efficient hardware implementation of the Sequence Alignment algorithm. We provide a simple and efficient asynchronous sequential design which can be readily implemented as an instruction in an extensible processor. Experimental results show that our circuit implementation can achieve a speed-up of 3.77X on average compared with the software counterpart, meanwhile reducing the area cost.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"798 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131600568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonino Tumeo, M. Monchiero, G. Palermo, Fabrizio Ferrandi, D. Sciuto
{"title":"Lightweight DMA management mechanisms for multiprocessors on FPGA","authors":"Antonino Tumeo, M. Monchiero, G. Palermo, Fabrizio Ferrandi, D. Sciuto","doi":"10.1109/ASAP.2008.4580191","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580191","url":null,"abstract":"This paper presents a multiprocessor system on FPGA that adopts Direct Memory Access (DMA) mechanisms to move data between the external memory and the local memory of each processor. The system integrates all standard DMA primitives via a fast Application Programming Interface (API) and relies on interrupts having also the possibility to manage a command list. This interface allows to program the embedded multiprocessor architecture on FPGA with simple DMAs using the same DMA techniques adopted on high performance multiprocessors with complex DMA controllers. Several experiments demonstrate the performance of our solution, allowing 57% improvement on the execution time of a selected set of benchmarks. We furthermore show how some DMA programming techniques (double and multi-buffering) can be effectively used within our platform, thus easing the design and development of the hardware and the software in a reconfigurable DMA-based environment.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"60 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114023770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miroslav Knezevic, K. Sakiyama, Y. Lee, I. Verbauwhede
{"title":"On the high-throughput implementation of RIPEMD-160 hash algorithm","authors":"Miroslav Knezevic, K. Sakiyama, Y. Lee, I. Verbauwhede","doi":"10.1109/ASAP.2008.4580159","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580159","url":null,"abstract":"In this paper we present two new architectures of the RIPEMD-160 hash algorithm for high throughput implementations. The first architecture achieves the iteration bound of RIPEMD-160, i.e. it achieves a theoretical upper bound on throughput at the micro-architecture level. The second architecture is designed by performing a gate level optimization and achieves a better performance than the first one at the cost of a larger gate area. Throughputs of 3.122 Gbps and 624 Mbps are achieved, with and without pipelining, respectively.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124100888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new high-performance scalable dynamic interconnection for FPGA-based reconfigurable systems","authors":"S. Jovanovic, C. Tanougast, S. Weber","doi":"10.1109/ASAP.2008.4580155","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580155","url":null,"abstract":"Networks on chip (NoCs) present viable interconnection architectures which are especially characterized by high level of parallelism, high performances and scalability. The already proposed NoC architectures in literature are mostly destined to system-on-chip (SoCs) designs. For a FPGA-based reconfigurable system, the proposed NoCs are not suitable. In this paper, we present a new high-performance interconnection approach destined for FPGA-based reconfigurable system. Our proposed NoC is based on a scalable communication unit characterized by its particularly architecture, an arbitration policy based on the priority-to-the-right rule and high performances. We present the basic concept of this communication approach and we prove its feasibility on examples through the simulations. Implementation results are also detailed.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129075826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory copies in multi-level memory systems","authors":"P. D. Langen, B. Juurlink","doi":"10.1109/ASAP.2008.4580192","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580192","url":null,"abstract":"Data movement operations, such as the C-style memcpy function, are often used to duplicate or communicate data. This type of function typically produces a significant amount of off-chip traffic. For current microprocessors, communication with off-chip memory is an increasing limitation to attain higher performance as well as a significant source of energy consumption. To decrease the amount of communication between a CPU and the off-chip memory system, we propose a system that implements a hardware memcpy in the memory level where the source data is located.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114572169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Run-time thread sorting to expose data-level parallelism","authors":"Tirath Ramdas, G. Egan, D. Abramson, K. Baldridge","doi":"10.1109/ASAP.2008.4580154","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580154","url":null,"abstract":"We address the problem of data parallel processing for computational quantum chemistry (CQC). CQC is a computationally demanding tool to study the electronic structure of molecules. An important algorithmic component of these computations is the evaluation of Electron Repulsion Integrals (ERIs). A key problem with ERI evaluation is controlflow variation between different ERI evaluations, which can only be resolved at runtime. This causes the computation to be unsuitable for data parallel execution. However, it is observed that although there is variation between ERI evaluations, the variation is limited; in fact there are a limited number of ERI classes present within any given workload. Conceptually, it is possible to classify the ERIs into sizable sets, and execute these sets in a data parallel fashion. Practically, creating these sets is computationally expensive. We describe an architecture to perform this thread sorting, where high throughput is achieved with small associative and multiport memories. The performance of the prototype is evaluated with FPGA synthesis. We go on to envision other uses for thread sorting, in general-purpose manycore architectures.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126946796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}