{"title":"Reducing power consumption of embedded processors through register file partitioning and compiler support","authors":"Xuan Guan, Yunsi Fei","doi":"10.1109/ASAP.2008.4580190","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580190","url":null,"abstract":"As embedded processors being widely used in specific application domains, such as communications, multimedia, and networking, the register file has contributed a substantial budget in embedded processor energy consumption due to its long working time for the data intensive computations and the large switching capacitance. It is found that 25% of registers can account for 83% of register file accessing time during many embedded application execution. This fact motivates us to reduce the register file power consumption by partitioning the registers to different regions according to their usage pattern. The most frequently used registers are put in the hot part, and the cold part of register file is rarely accessed. We employ the register file bitline splitting and the drowsy register cell techniques in our design to reduce the overall accessing power of the register file. We propose a novel approach to partition the register file in a way so that the largest power saving can be achieved. We formulate the register file partitioning process into a graph partitioning problem, and apply an effective algorithm to obtain the optimal result. We evaluate our algorithm on MiBench applications, and an average saving of 43.6% in the register file access power consumption over the original non-partitioned register file is achieved for the SimpleScalar PISA system.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124730372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ritesh Rajore, Ganesh Garga, H. Jamadagni, S. Nandy
{"title":"Reconfigurable Viterbi decoder on mesh connected multiprocessor architecture","authors":"Ritesh Rajore, Ganesh Garga, H. Jamadagni, S. Nandy","doi":"10.1109/ASAP.2008.4580153","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580153","url":null,"abstract":"In modern wireline and wireless communication systems, Viterbi decoder is one of the most compute intensive and essential elements. Each standard requires a different configuration of Viterbi decoder. Hence there is a need to design a flexible reconfigurable Viterbi decoder to support different configurations on a single platform. In this paper we present a reconfigurable Viterbi decoder which can be reconfigured for standards such as WCDMA, CDMA2000, IEEE 802.11, DAB, DVB, and GSM. Different parameters like code rate, constraint length, polynomials and truncation length can be configured to map any of the above mentioned standards. Our design provides higher throughput and scalable power consumption in various configuration of the reconfigurable Viterbi decoder. The power and throughput can also be optimized for different standards.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132612027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jehangir Khan, S. Niar, A. Rivenq, Y. Elhillali, J. Dekeyser
{"title":"An MPSoC architecture for the Multiple Target Tracking application in driver assistant system","authors":"Jehangir Khan, S. Niar, A. Rivenq, Y. Elhillali, J. Dekeyser","doi":"10.1109/ASAP.2008.4580166","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580166","url":null,"abstract":"This article discusses the design of an application specific MPSoC architecture dedicated to multiple target tracking (MTT). This application has its utility in driver assistant systems, more precisely in collision avoidance and warning systems. An automotive-radar is used as the front end sensor in our application. The article examines the tradeoffs that must be taken into consideration in the realization of the entire MTT application in an embedded system. In our implementation of MTT, several independent parallel tasks have been identified and mapped onto a multiprocessor architecture to ensure the deadlines imposed by the application. Our study demonstrates that the joint utilization of reconfigurable circuits (namely FPGA) and MPSoC, facilitates the development of a flexible and efficient MTT system.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125684347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Throughput-scalable hybrid-pipeline architecture for multilevel lifting 2-D DWT of JPEG 2000 coder","authors":"B. K. Mohanty, P. Meher","doi":"10.1109/ASAP.2008.4580196","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580196","url":null,"abstract":"In this paper, we propose a pipelined-architecture for high-throughput computation of multilevel lifting 2D discrete wavelet transform (DWT). The multilevel DWT computation is shared by the proposed devices based on pyramid algorithm (PA) and recursive pyramid algorithm (RPA), where the PA-based devices compute the lower order subands and the higher order subbands are computed by an RPA-based device. The hardware- and time-complexities of the proposed structure are compared with those of the existing recursive architectures for performance evaluation. Compared with the best of the existing recursive architectures, the proposed one has nearly 16 times less average computation time (ACT) for the 2D DWT of input size 512 x 512 for S=32, where S is half of the input rate of the structure. Moreover, it involves less number of multipliers and adders than the others when normalized for unit throughput rate. The proposed design offers nearly 100% utilization efficiency for S=32, and 94% efficiency for S=8. The latency of the structure is very small (which is of the order of a few cycles), and involves a small on-chip storage and less number of data/pipeline registers.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123744542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Managing multi-core soft-error reliability through utility-driven cross domain optimization","authors":"Wangyuan Zhang, Tao Li","doi":"10.1109/ASAP.2008.4580167","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580167","url":null,"abstract":"As semiconductor processing technology continues to scale down, managing reliability becomes an increasingly difficult challenge in high-performance microprocessor design. Transient faults, also known as soft errors, corrupt program data at the circuit level and cause incorrect program execution and system crashes. Future processors will consist of billions of transistors organized as multicore microarchitectures. Packaging multiple cores (and hence more transistors) onto the same die exposes more devices to soft error strikes. This paper explores utility-function-driven (benefit driven) cross domain optimization for both performance and reliability. We propose the use of utility-based resource management for individual cores while applying utility-based shared cache partitioning across multiple cores. Moreover, we coordinate the optimization of multiple resources based on their cross domain utility information to achieve attractive performance and reliability tradeoffs. Extensive experimental results show that, on average, our utility-driven cross domain optimization reduces the soft error rate of the most vulnerable core in a chip multiprocessor (CMP) by up to 35% and improves the CMPpsilas overall reliability by 22% with less than 3% performance degradation across 15 investigated workloads.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122694701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient systolization of cyclic convolution for systolic implementation of sinusoidal transforms","authors":"P. Meher","doi":"10.1109/ASAP.2008.4580161","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580161","url":null,"abstract":"This paper presents an algorithm to convert composite-length cyclic convolution into a block cyclic convolution sum of small matrix-vector products, even if the co-factors of convolution-length are not mutually prime. It is shown that by using optimal short-length convolution algorithms, the block-convolution could be computed from a few short-length cyclic and cyclic-like convolutions, when one of the co-factors belongs to {2, 3, 4, 6, 8}. A generalized systolic array is derived for cyclic-like convolution, and used that for the computation of long-length convolutions. The proposed structure for convolution-length N= 2L involves nearly the same hardware and half the time-complexity as the direct implementation; and the structure for N= 4L involves sime12.5% more hardware and one-fourth the time-complexity of the latter. The structures for N=2L and N=4L, respectively, have the same and sime12.5% less area-time complexity as the corresponding existing prime-factor systolic structures, but unlike the latter type, do not involve complex input/output mapping; and could be used even if the co-factors of convolution-length are not relatively prime.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126085009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masih Rahmaty, Mohammad S. Sadri, Mehdi Ataei Naeini
{"title":"FPGA based singular value decomposition for image processing applications","authors":"Masih Rahmaty, Mohammad S. Sadri, Mehdi Ataei Naeini","doi":"10.1109/ASAP.2008.4580176","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580176","url":null,"abstract":"During last decades, singular value decomposition has been widely used in different fields of engineering and science. This makes SVD calculation algorithms and its feasible implementations, an attractive area of research. FPGA implementation of SVD is addressed in some past publications, however, appearance of new primary elements such as dedicated hardware multipliers, block memories and CPU cores inside new FPGA products, such as Xilinx Virtex-4, made it possible to use them in more complicated computation tasks.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125226071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multi-FPGA application-specific architecture for accelerating a floating point Fourier Integral Operator","authors":"Jason Lee, Lesley Shannon, M. Yedlin, G. Margrave","doi":"10.1109/ASAP.2008.4580178","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580178","url":null,"abstract":"Many complex systems require the use of floating point arithmetic that is exceedingly time consuming to perform on personal computers. However, floating point operators are also hardware resource intensive and require longer latencies than fixed point operators to complete. Due to the reduced logic density of FPGAs relative to ASICs, it is often only possible to accelerate a portion of a floating point application in hardware. This paper presents an application-specific architecture for the hardware acceleration of a complete Fourier Integral Operator (FIO) kernel used in seismic imaging on a multi-FPGA platform. The design utilizes several floating point computing elements (CEs) to calculate the FIO kernel in parallel stages on multiple FPGAs. A detailed study of floating point CEs, including a Fast Fourier Transform (FFT) CE, and a complete FIO prototype implementation on the BEE2 platform is described. The prototype implementation has a 12.4x increase in throughput over an optimized software implementation, and a predicted 15.8x increase in throughput on the BEE3 platform.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126315274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Atasu, O. Mencer, W. Luk, C. Özturan, Günhan Dündar
{"title":"Fast custom instruction identification by convex subgraph enumeration","authors":"K. Atasu, O. Mencer, W. Luk, C. Özturan, Günhan Dündar","doi":"10.1109/ASAP.2008.4580145","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580145","url":null,"abstract":"Automatic generation of custom instruction processors from high-level application descriptions enables fast design space exploration, while offering very favorable performance and silicon area combinations. This work introduces a novel method for adapting the instruction set to match an application captured in a high-level language. A simplified model is used to find the optimal instructions via enumeration of maximal convex subgraphs of application data flow graphs (DFGs). Our experiments involving a set of multimedia and cryptography benchmarks show that an order of magnitude performance improvement can be achieved using only a limited amount of hardware resources. In most cases, our algorithm takes less than a second to execute.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123134417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Lorünser, E. Querasser, T. Matyus, M. Peev, J. Wolkerstorfer, M. Hutter, Alexander Szekely, I. Wimberger, Christian Pfaffel-Janser, A. Neppach
{"title":"Security processor with quantum key distribution","authors":"T. Lorünser, E. Querasser, T. Matyus, M. Peev, J. Wolkerstorfer, M. Hutter, Alexander Szekely, I. Wimberger, Christian Pfaffel-Janser, A. Neppach","doi":"10.1109/ASAP.2008.4580151","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580151","url":null,"abstract":"We present a fully operable security gateway prototype, integrating quantum key distribution and realised as a system-on-chip. It is implemented on a field-programmable gate array and provides a virtual private network with low latency and gigabit throughput. The seamless hard- and software integration of a quantum key distribution layer enables high key-update rates for the encryption modules. Hence, the amount of data encrypted with one session key can be significantly decreased. We realise a highly modular architecture and make extensive use of software/hardware partitioning. This work is the first approach towards application of a new key distribution technology in dedicated security processors. In particular, it elaborates requirements for the integration of quantum key distribution on a chip level.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130462425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}