{"title":"Twiddle factor transformation for pipelined FFT processing","authors":"I. Park, WonHee Son, Ji-Hoon Kim","doi":"10.1109/ICCD.2007.4601872","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601872","url":null,"abstract":"This paper presents a novel transformation technique that can derive various fast Fourier transform (FFT) in a unified paradigm. The proposed algorithm is to find a common twiddle factor at the input side of a butterfly and migrate it to the output side. Starting from the radix-2 FFT algorithm, the proposed common factor migration technique can generate most of previous FFT algorithms without using mathematical manipulation. In addition, we propose new FFT algorithms derived by applying the proposed twiddle factor moving technique, which reduce the number of twiddle factors significantly compared with the previous algorithms being widely used for pipelined FFT processing.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"13 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84203304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel profile-driven technique for simultaneous power and code-size optimization of microcoded IPs","authors":"B. Gorjiara, D. Gajski","doi":"10.1109/ICCD.2007.4601960","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601960","url":null,"abstract":"Microcoded customized IPs have significantly better performance, yet larger code size, compared to similarly-sized instruction-based processors. Storing wide microcodes on-chip requires wide memory-blocks that occupy a large area and consume high leakage power. Therefore, addressing the code size of microcoded IPs is very important. In this paper, we introduce compression techniques that along with careful resolution of ldquodonpsilat carerdquo values (denoted by dasiaXpsila) in microcode can address the code size issue. We observed that dasiaXpsila values can be used for improving either dynamic power of IPs or their compression. However, achieving the efficiency of both is challenging. In this paper, we propose a profile-guided dasiaXpsila-resolution technique that can achieve both power and compression efficiency. Using our technique, the code size of microcoded IPs is reduced by 2.7 times, while saving 20% dynamic power, on average.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"68 1","pages":"609-614"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84101842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the reliability of on-chip L2 cache using redundancy","authors":"K. Bhattacharya, Soontae Kim, N. Ranganathan","doi":"10.1109/ICCD.2007.4601906","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601906","url":null,"abstract":"The reliability of large on-chip L2 cache poses a significant challenge due to technology scaling trends. As the minimum feature size continues to decrease, the L2 caches become more vulnerable to multi-bit soft errors. Traditionally, L2 caches have been protected from multi-bit soft errors using techniques like using error detection/correction codes or employing physical interleaving of cache bit lines to convert multi-bit errors into single-bit errors. These methods, however, incur large overheads in area and power. In this work, we investigate several new techniques for reducing multi-bit errors in large L2 caches, in which the multi-bit errors are detected using simple error detection codes and corrected using the data redundancy in the memory hierarchy. Further, we develop a reliability aware replacement policy that dynamically trades performance for reliability whenever the soft-error budget is exceeded. In order to further improve reliability, we propose the duplication of the data values in cache lines by exploiting their small data widths. The proposed techniques were implemented in the Simplescalar framework and validated using the SPEC 2000 integer and floating point benchmarks. The proposed techniques improve the reliability of L2 caches by 40% and 32% on the average, for integer and floating point applications respectively, with little impact on performance and area.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"7 1","pages":"224-229"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83646846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Contention-free switch-based implementation of 1024-point Radix-2 Fourier Transform Engine","authors":"H. Saleh, B. Mohd, A. Aziz, E. Swartzlander","doi":"10.1109/ICCD.2007.4601873","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601873","url":null,"abstract":"This paper examines the use of a switch based architecture to implement a Radix-2 decimation in frequency fast Fourier transform engine. The architecture interconnects M processing elements with 2*M memories. An algorithm to detect and resolve memory access contention is presented. The implementation of 1024-point FFTs with 2 processing elements is discussed in detail, including timing and place-and-route results. The switch based architecture provides a factor of M speedup over a single processing element realization.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"35 1","pages":"7-12"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75601104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the reliability of on-chip data caches under process variations","authors":"Wei Wu, S. Tan, Jun Yang, Shih-Lien Lu","doi":"10.1109/ICCD.2007.4601920","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601920","url":null,"abstract":"On-chip caches take a large portion of the chip area. They are much more vulnerable to parameter variation than smaller units. As leakage current becomes a significant component of the total power consumption, the leakage current variations induced thermal and reliability problem to the on-chip caches become an important design concern. This paper studies the impact of process variations, particular the leakage variations, on the temperature and reliability of on-chip caches. Our statistical simulation shows that, under process variation, 85% of the caches see shortened lifetime, with average lifetime being 81.6% of the ideal cache. At runtime, unevenly distributed dynamic power and the corresponding thermal variation would further deteriorate the situation. To mitigate this problem, we propose a dynamic cache subarray permutation scheme that can alleviate the thermal stress on a high-leakage area to improve the reliability of the caches. Experiments on 17 Spec2k benchmarks show that our scheme can extend the cache lifetime by up to 20.3%, and reduce the peak temperature by 7 degrees on average and more on data-intensive applications.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"13 1","pages":"325-332"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72782691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Limits on voltage scaling for caches utilizing fault tolerant techniques","authors":"Avesta Sasan, A. Djahromi, A. Eltawil, F. Kurdahi","doi":"10.1109/ICCD.2007.4601943","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601943","url":null,"abstract":"This paper proposes a new low power cache architecture that utilizes fault tolerance to allow aggressively reduced voltage levels. The fault tolerant overhead circuits consume little energy, but enable the system to operate correctly and boost the system performance to close to defect free operation. Overall, power savings of over 40% are reported on standard benchmarks.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"92 1","pages":"488-495"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74325202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring DRAM cache architectures for CMP server platforms","authors":"Li Zhao, R. Iyer, R. Illikkal, D. Newell","doi":"10.1109/ICCD.2007.4601880","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601880","url":null,"abstract":"As dual-core and quad-core processors arrive in the marketplace, the momentum behind CMP architectures continues to grow strong. As more and more cores/threads are placed on-die, the pressure on the memory subsystem is rapidly increasing. To address this issue, we explore DRAM cache architectures for CMP platforms. In this paper, we investigate the impact of introducing a low latency, large capacity and high bandwidth DRAM-based cache between the last level SRAM cache and memory subsystem. We first show the potential benefits of large DRAM caches for key commercial server workloads. As the primary hurdle to achieving these benefits with DRAM caches is the tag space overheads associated with them, we identify the most efficient DRAM cache organization and investigate various options. Our results show that the combination of 8-bit partial tags and 2-way sectoring achieves the highest performance (20% to 70%) with the lowest tag space (<25%) overhead.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"74 1","pages":"55-62"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77376986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-based alternate test of RF components","authors":"S. S. Akbay, A. Chatterjee","doi":"10.1109/ICCD.2007.4601947","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601947","url":null,"abstract":"Defect-based RF testing is a strong candidate for providing the best solution in terms of ATE complexity and cost. However, specification-based testing is still the norm for analog/RF because of the limitations of analog fault models. Unfortunately, as the amount of functionality packed into individual devices is increased with each generation, the cost of testing larger numbers of specifications also increases. To address this, the alternate test methodology proposed in the past, which significantly cuts costs associated with specification tests by crafting a single test stimulus and mapping the response signatures into all specifications at once, can be modified for defect-based testing as well. In this work, we explore a new type of alternate test that is more fundamental than defect-based or specification-based approaches. Rather than focusing on physical defect mechanisms or the way individual specifications are measured, fault-based alternate test studies the abstractions of physical phenomena that cause specification violations; it unifies the benefits of reduced ATE complexity of defect-based approaches and the compact stimulus-signature pairs of specification-based alternate tests.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"1 1","pages":"518-525"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81569939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application of symbolic computer algebra to arithmetic circuit verification","authors":"Yuki Watanabe, N. Homma, T. Aoki, T. Higuchi","doi":"10.1109/ICCD.2007.4601876","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601876","url":null,"abstract":"This paper presents a formal approach to verify arithmetic circuits using symbolic computer algebra. Our method describes arithmetic circuits directly with high-level mathematical objects based on weighted number systems and arithmetic formulae. Such circuit description can be effectively verified by polynomial reduction techniques using Grobner Bases. In this paper, we describe how the symbolic computer algebra can be used to describe and verify arithmetic circuits. The advantageous effects of the proposed approach are demonstrated through experimental verification of some arithmetic circuits such as multiply-accumulator and FIR filter. The result shows that the proposed approach has a definite possibility of verifying practical arithmetic circuits where the conventional techniques failed.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"32 1","pages":"25-32"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86208841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical simulation of chip multiprocessors running multi-program workloads","authors":"Davy Genbrugge, L. Eeckhout","doi":"10.1109/ICCD.2007.4601940","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601940","url":null,"abstract":"This paper explores statistical simulation as a fast simulation technique for driving chip multiprocessor (CMP) design space exploration. The idea of statistical simulation is to measure a number of important program execution characteristics, generate a synthetic trace, and simulate that synthetic trace. The important benefit is that a synthetic trace is very small compared to real program traces. This paper advances statistical simulation by modeling shared resources, such as shared caches and off-chip bandwidth. This is done (i) by collecting cache set access probabilities and per-set LRU stack depth profiles, and (ii) by modeling a programpsilas time-varying execution behavior in the synthetic trace. The key benefit is that the statistical profile is independent of a given cache configuration and the amount of multiprocessing, which enables statistical simulation to model conflict behavior in shared caches when multiple programs are co-executing on a CMP. We demonstrate that statistical simulation is both accurate and fast with average IPC prediction errors of less than 5.5% and simulation speedups of 40X to 70X compared to the detailed simulation of 100M-instruction traces. This makes statistical simulation a viable tool for CMP design space exploration.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"26 1","pages":"464-471"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87554741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}