2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献_第4页

Noxim: An open, extensible and cycle-accurate network on chip simulator 一个开放的，可扩展的，周期精确的网络芯片模拟器

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245728

V. Catania, Andrea Mineo, Salvatore Monteleone, M. Palesi, Davide Patti

引用次数: 236

Programmable RNS lattice-based parallel cryptographic decryption 基于可编程RNS格的并行密码解密

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245723

P. Martins, L. Sousa, J. Eynard, J. Bajard

引用次数: 8

Stochastic circuit design and performance evaluation of vector quantization 随机电路设计及矢量量化性能评价

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245717

Ran Wang, Jie Han, B. Cockburn, D. Elliott

{"title":"Stochastic circuit design and performance evaluation of vector quantization","authors":"Ran Wang, Jie Han, B. Cockburn, D. Elliott","doi":"10.1109/ASAP.2015.7245717","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245717","url":null,"abstract":"Vector quantization (VQ) is a general data compression technique that has a scalable implementation complexity and potentially a high compression ratio. In this paper, a novel implementation of VQ using stochastic circuits is proposed and its performance is evaluated. The stochastic and binary designs are compared for the same compression quality and the circuits are synthesized for an industrial 28-nm cell library. The effects of varying the sequence length of the stochastic design are studied with respect to the performance metric of throughput per area (TPA). When a shortened 512-bit encoding sequence is used to obtain a lower quality compression, the TPA is about 2.60 times that of the binary implementation with the same quality as that of the stochastic implementation measured by the L1 norm error (i.e., the first-order error). Thus, the stochastic implementation outperforms the conventional binary design in terms of TPA for a relatively low compression quality. By exploiting the progressive precision feature of a stochastic circuit, a readily scalable processing quality can be attained by simply halting the computation after different numbers of clock cycles.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"38 1","pages":"111-115"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74640030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

An IEEE 754 double-precision floating-point multiplier for denormalized and normalized floating-point numbers 用于非规范化和规范化浮点数的ieee754双精度浮点乘法器

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245706

S. Thompson, J. Stine

引用次数: 7

Large-scale packet classification on FPGA 基于FPGA的大规模分组分类

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245738

Shijie Zhou, Yun Qu, V. Prasanna

引用次数: 11

LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs LightSpMV:在支持cuda的gpu上更快的基于csr的稀疏矩阵向量乘法

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245713

Yongchao Liu, B. Schmidt

{"title":"LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs","authors":"Yongchao Liu, B. Schmidt","doi":"10.1109/ASAP.2015.7245713","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245713","url":null,"abstract":"Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors. In LightSpMV, two dynamic row distribution approaches have been investigated at the vector and warp levels with atomic operations and warp shuffle functions as the fundamental building blocks. We have evaluated LightSpMV using various sparse matrices and further compared it to the CSR-based SpMV subprograms in the state-of-the-art CUSP and cuSPARSE libraries. Performance evaluation reveals that on the same Tesla K40c GPU, LightSpMV is superior to both CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single and double precision, respectively. LightSpMV is available at http://lightspmv.sourceforge.net.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"82-89"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82739057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

How can Garbage Collection be energy efficient by dynamic offloading? 垃圾收集如何通过动态卸载实现节能?

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245725

Jie Tang, Chen Liu, J. Gaudiot

引用次数: 0

An efficient architecture solution for low-power real-time background subtraction 一种低功耗实时背景减法的高效架构解决方案

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245737

H. Tabkhi, Majid Sabbagh, G. Schirner

{"title":"An efficient architecture solution for low-power real-time background subtraction","authors":"H. Tabkhi, Majid Sabbagh, G. Schirner","doi":"10.1109/ASAP.2015.7245737","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245737","url":null,"abstract":"Embedded vision is a rapidly growing market with a host of challenging algorithms. Among vision algorithms, Mixture of Gaussian (MoG) background subtraction is a frequently used kernel involving massive computation and communication. Tremendous challenges need to be reolved to provide MoG's high computation and communication demands with minimal power consumption allowing its embedded deployment. This paper proposes a customized architecture for power-efficient realization of MoG background subtraction operating at Full-HD resolution. Our design process benefits from system-level design principles. An SLDL-captured specification (result of high-level explorations) serves as a specification for architecture realization and hand-crafted RTL design. To optimize the architecture, this paper employs a set of optimization techniques including parallelism extraction, algorithm tuning, operation width sizing and deep pipelining. The final MoG implementation consists of 77 pipeline stages operating at 148.5 MHz implemented on a Zynq-7000 SoC. Furthermore, our background subtraction solution is flexible allowing end users to adjust algorithm parameters according to scene complexity. Our results demonstrate a very high efficiency for both indoor and outdoor scenes with 145 mW on-chip power consumption and more than 600× speedup over software execution on ARM Cortex A9 core.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"13 1","pages":"218-225"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88200441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Mixed-signal implementation of differential decoding using binary message passing algorithms 用二进制消息传递算法实现差分解码的混合信号

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245718

G. Cowan, Kevin Cushon, W. Gross

引用次数: 1

On-demand fault-tolerant loop processing on massively parallel processor arrays 大规模并行处理器阵列上的按需容错循环处理

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245734

Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari

{"title":"On-demand fault-tolerant loop processing on massively parallel processor arrays","authors":"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari","doi":"10.1109/ASAP.2015.7245734","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245734","url":null,"abstract":"We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"70 1","pages":"194-201"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90563008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7