2008 International Conference on Application-Specific Systems, Architectures and Processors最新文献_第4页

Reducing power consumption of embedded processors through register file partitioning and compiler support 通过寄存器文件分区和编译器支持降低嵌入式处理器的功耗

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580190

Xuan Guan, Yunsi Fei

{"title":"Reducing power consumption of embedded processors through register file partitioning and compiler support","authors":"Xuan Guan, Yunsi Fei","doi":"10.1109/ASAP.2008.4580190","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580190","url":null,"abstract":"As embedded processors being widely used in specific application domains, such as communications, multimedia, and networking, the register file has contributed a substantial budget in embedded processor energy consumption due to its long working time for the data intensive computations and the large switching capacitance. It is found that 25% of registers can account for 83% of register file accessing time during many embedded application execution. This fact motivates us to reduce the register file power consumption by partitioning the registers to different regions according to their usage pattern. The most frequently used registers are put in the hot part, and the cold part of register file is rarely accessed. We employ the register file bitline splitting and the drowsy register cell techniques in our design to reduce the overall accessing power of the register file. We propose a novel approach to partition the register file in a way so that the largest power saving can be achieved. We formulate the register file partitioning process into a graph partitioning problem, and apply an effective algorithm to obtain the optimal result. We evaluate our algorithm on MiBench applications, and an average saving of 43.6% in the register file access power consumption over the original non-partitioned register file is achieved for the SimpleScalar PISA system.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124730372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Reconfigurable Viterbi decoder on mesh connected multiprocessor architecture 基于网格连接多处理器架构的可重构维特比解码器

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580153

Ritesh Rajore, Ganesh Garga, H. Jamadagni, S. Nandy

引用次数: 3

An MPSoC architecture for the Multiple Target Tracking application in driver assistant system 多目标跟踪在驾驶员辅助系统中的应用

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580166

Jehangir Khan, S. Niar, A. Rivenq, Y. Elhillali, J. Dekeyser

引用次数: 29

Throughput-scalable hybrid-pipeline architecture for multilevel lifting 2-D DWT of JPEG 2000 coder JPEG 2000编码器多层提升二维DWT的吞吐量可扩展混合管道架构

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580196

B. K. Mohanty, P. Meher

{"title":"Throughput-scalable hybrid-pipeline architecture for multilevel lifting 2-D DWT of JPEG 2000 coder","authors":"B. K. Mohanty, P. Meher","doi":"10.1109/ASAP.2008.4580196","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580196","url":null,"abstract":"In this paper, we propose a pipelined-architecture for high-throughput computation of multilevel lifting 2D discrete wavelet transform (DWT). The multilevel DWT computation is shared by the proposed devices based on pyramid algorithm (PA) and recursive pyramid algorithm (RPA), where the PA-based devices compute the lower order subands and the higher order subbands are computed by an RPA-based device. The hardware- and time-complexities of the proposed structure are compared with those of the existing recursive architectures for performance evaluation. Compared with the best of the existing recursive architectures, the proposed one has nearly 16 times less average computation time (ACT) for the 2D DWT of input size 512 x 512 for S=32, where S is half of the input rate of the structure. Moreover, it involves less number of multipliers and adders than the others when normalized for unit throughput rate. The proposed design offers nearly 100% utilization efficiency for S=32, and 94% efficiency for S=8. The latency of the structure is very small (which is of the order of a few cycles), and involves a small on-chip storage and less number of data/pipeline registers.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123744542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Managing multi-core soft-error reliability through utility-driven cross domain optimization 通过实用程序驱动的跨域优化管理多核软错误可靠性

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580167

Wangyuan Zhang, Tao Li

{"title":"Managing multi-core soft-error reliability through utility-driven cross domain optimization","authors":"Wangyuan Zhang, Tao Li","doi":"10.1109/ASAP.2008.4580167","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580167","url":null,"abstract":"As semiconductor processing technology continues to scale down, managing reliability becomes an increasingly difficult challenge in high-performance microprocessor design. Transient faults, also known as soft errors, corrupt program data at the circuit level and cause incorrect program execution and system crashes. Future processors will consist of billions of transistors organized as multicore microarchitectures. Packaging multiple cores (and hence more transistors) onto the same die exposes more devices to soft error strikes. This paper explores utility-function-driven (benefit driven) cross domain optimization for both performance and reliability. We propose the use of utility-based resource management for individual cores while applying utility-based shared cache partitioning across multiple cores. Moreover, we coordinate the optimization of multiple resources based on their cross domain utility information to achieve attractive performance and reliability tradeoffs. Extensive experimental results show that, on average, our utility-driven cross domain optimization reduces the soft error rate of the most vulnerable core in a chip multiprocessor (CMP) by up to 35% and improves the CMPpsilas overall reliability by 22% with less than 3% performance degradation across 15 investigated workloads.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122694701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Efficient systolization of cyclic convolution for systolic implementation of sinusoidal transforms

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580161

P. Meher

{"title":"Efficient systolization of cyclic convolution for systolic implementation of sinusoidal transforms","authors":"P. Meher","doi":"10.1109/ASAP.2008.4580161","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580161","url":null,"abstract":"This paper presents an algorithm to convert composite-length cyclic convolution into a block cyclic convolution sum of small matrix-vector products, even if the co-factors of convolution-length are not mutually prime. It is shown that by using optimal short-length convolution algorithms, the block-convolution could be computed from a few short-length cyclic and cyclic-like convolutions, when one of the co-factors belongs to {2, 3, 4, 6, 8}. A generalized systolic array is derived for cyclic-like convolution, and used that for the computation of long-length convolutions. The proposed structure for convolution-length N= 2L involves nearly the same hardware and half the time-complexity as the direct implementation; and the structure for N= 4L involves sime12.5% more hardware and one-fourth the time-complexity of the latter. The structures for N=2L and N=4L, respectively, have the same and sime12.5% less area-time complexity as the corresponding existing prime-factor systolic structures, but unlike the latter type, do not involve complex input/output mapping; and could be used even if the co-factors of convolution-length are not relatively prime.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126085009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

FPGA based singular value decomposition for image processing applications 基于FPGA的奇异值分解图像处理应用

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580176

Masih Rahmaty, Mohammad S. Sadri, Mehdi Ataei Naeini

引用次数: 23

A multi-FPGA application-specific architecture for accelerating a floating point Fourier Integral Operator 用于加速浮点傅里叶积分算子的多fpga应用特定架构

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580178

Jason Lee, Lesley Shannon, M. Yedlin, G. Margrave

{"title":"A multi-FPGA application-specific architecture for accelerating a floating point Fourier Integral Operator","authors":"Jason Lee, Lesley Shannon, M. Yedlin, G. Margrave","doi":"10.1109/ASAP.2008.4580178","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580178","url":null,"abstract":"Many complex systems require the use of floating point arithmetic that is exceedingly time consuming to perform on personal computers. However, floating point operators are also hardware resource intensive and require longer latencies than fixed point operators to complete. Due to the reduced logic density of FPGAs relative to ASICs, it is often only possible to accelerate a portion of a floating point application in hardware. This paper presents an application-specific architecture for the hardware acceleration of a complete Fourier Integral Operator (FIO) kernel used in seismic imaging on a multi-FPGA platform. The design utilizes several floating point computing elements (CEs) to calculate the FIO kernel in parallel stages on multiple FPGAs. A detailed study of floating point CEs, including a Fast Fourier Transform (FFT) CE, and a complete FIO prototype implementation on the BEE2 platform is described. The prototype implementation has a 12.4x increase in throughput over an optimized software implementation, and a predicted 15.8x increase in throughput on the BEE3 platform.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126315274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Fast custom instruction identification by convex subgraph enumeration 通过凸子图枚举快速自定义指令识别

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580145

K. Atasu, O. Mencer, W. Luk, C. Özturan, Günhan Dündar

引用次数: 48

Security processor with quantum key distribution 量子密钥分发安全处理器

2008 International Conference on Application-Specific Systems, Architectures and Processors Pub Date : 2008-07-02 DOI: 10.1109/ASAP.2008.4580151

T. Lorünser, E. Querasser, T. Matyus, M. Peev, J. Wolkerstorfer, M. Hutter, Alexander Szekely, I. Wimberger, Christian Pfaffel-Janser, A. Neppach

引用次数: 6