{"title":"Power efficient register file update approach for embedded processors","authors":"R. Ayoub, A. Orailoglu","doi":"10.1109/ICCD.2007.4601935","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601935","url":null,"abstract":"In this paper we present an approach for a low power register file in the domain of embedded processors. The suggested approach obtains power savings through tackling the unnecessary writes to register files for short live registers. Writes to register files are essentially redundant when an instruction manages to forward its results to all of its dependents through forwarding hardware. As the percentage of registers that exhibit short liveness is shown to be significant, tackling unnecessary writes contributes to delivering appreciable power savings. In this work we show that tackling the unnecessary writes could be attained efficiently through a register based encoding scheme. The suggested encoding scheme exploits application-specific information and renames all or most of the short live registers to a small subset of the registers that are prespecified during the hardware design. The renaming process is performed at the compiler level. Power savings can be obtained through precluding the set of prespecified registers from writing to the register file. We suggest in this paper efficient algorithms for the purpose of renaming, one algorithm to perform the renaming in the cases of no register pressure and another one for the cases of register pressure. In the cases of register pressure, some of the prespecified registers may need to be turned into normal registers, a process that is managed through the use of reprogrammable hardware support. Although the cases of register pressure could impact power savings, the detailed analysis we outline shows that the size of the prespecified registers subset is typically small which makes register pressure an infrequent event. Experimental analysis on numerical and DSP codes indicates appreciable improvements in power savings.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"17 1","pages":"431-437"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78386680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Kumary, Partha Kunduz, A.P. Singhx, Li-Shiuan Pehy, N. K. Jhay
{"title":"A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS","authors":"A. Kumary, Partha Kunduz, A.P. Singhx, Li-Shiuan Pehy, N. K. Jhay","doi":"10.1109/ICCD.2007.4601881","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601881","url":null,"abstract":"As chip multiprocessors (CMPs) become the only viable way to scale up and utilize the abundant transistors made available in current microprocessors, the design of on-chip networks is becoming critically important. These networks face unique design constraints and are required to provide extremely fast and high bandwidth communication, yet meet tight power and area budgets. In this paper, we present a detailed design of our on-chip network router targeted at a 36-core shared-memory CMP system in 65 nm technology. Our design targets an aggressive clock frequency of 3.6 GHz, thus posing tough design challenges that led to several unique circuit and microarchitectural innovations and design choices, including a novel high throughput and low latency switch allocation mechanism, a non-speculative single-cycle router pipeline which uses advanced bundles to remove control setup overhead, a low-complexity virtual channel allocator and a dynamically-managed shared buffer design which uses prefetching to minimize critical path delay. Our router takes up 1.19 mm2 area and expends 551 mW power at 10% activity, delivering a single-cycle no-load latency at 3.6 GHz clock frequency while achieving apeak switching data rate in excess of 4.6 Tbits/sper router node.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"163 1","pages":"63-70"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83528352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimized design of a double-precision floating-point multiply-add-dused unit for data dependence","authors":"Gongqiong Li, Zhaolin Li","doi":"10.1109/ICCD.2007.4601918","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601918","url":null,"abstract":"This paper presents a novel double-precision floating-point multiply-add-fused unit, which is implemented in three pipeline stages. The main improvement over the conventional design is data dependence between two consecutive floating-point instructions is considered. In the new design the intermediate computation results of the first floating-point instruction are first pretreated and then fed back to the first stage for being directly used by the second floating-point instruction if the two consecutive floating-point instructions are data dependent. In this way, floating point instructions can be executed directly following their preceding floating-point instructions without being stalled due to data dependence. 11 data dependence cases are accelerated in this paper. The experiments, which are done over four SPEC2000 benchmark programs, show that 25% performance increase can be attained at the cost of 0.27 ns time delay added to the critical path.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"32 1","pages":"311-316"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89976521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Pathak, S. Mukherjee, M. Swaminathan, E. Engin, S. Lim
{"title":"Placement and routing of RF embedded passive designs in LCP substrate","authors":"M. Pathak, S. Mukherjee, M. Swaminathan, E. Engin, S. Lim","doi":"10.1109/ICCD.2007.4601913","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601913","url":null,"abstract":"Physical layout generation of RF embedded passive design is not an easy task since the response of a given layout is tightly coupled with the response of the individual components and the effect of interconnect parasitics. In this paper we propose a methodology for automatic layout generation of embedded passive RF circuits. We make use of circuit models to represent and optimize a given layout and use non-linear optimization at various stages of the methodology to obtain the desired goals. Full-wave EM simulations is completely out of the design loop, so our methodology significantly reduces the design time for RF embedded passive circuits. The proposed approach has been used successfully to generate layout for band-pass filters of varying sizes.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"63 1","pages":"273-279"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77801641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory based computation using embedded cache for processor yield and reliability improvement","authors":"Somnath Paul, S. Bhunia","doi":"10.1109/ICCD.2007.4601922","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601922","url":null,"abstract":"VLSI systems in the nanometer regime suffer from high defect rates and large parametric variations that lead to yield loss as well as reduced reliability of operation. In this paper, we propose a novel memory-based computation framework that exploits on-chip memory for reliable operation by transferring activity from a defective or unreliable functional unit to the embedded memory. This allows the die to run at a reduced performance level instead of being completely discarded or being throttled (in case of variations). We show that the proposed method improves yield and reliability in a superscalar out-of-order processor by tolerating defective functional units and allowing dynamic thermal management. The simulation results show that it entails only a small loss in performance (average 1.8%) at the cost of 9.5% of area overhead required with hardware duplication.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"9 1","pages":"341-346"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73675550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prioritizing verification via value-based correctness criticality","authors":"Joonhyuk Yoo, M. Franklin","doi":"10.1109/ICCD.2007.4601921","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601921","url":null,"abstract":"Microprocessors are becoming increasingly susceptible to soft errors due to the current trends of semiconductor technology scaling. Traditional redundant multi-threading architectures provide good fault tolerance by re-executing all the computations. However, such a full re-execution significantly increases the demand on the processor resources, resulting in severe performance degradation. To address this problem, this paper introduces a correctness criticality based filter checker, which prioritizes the verification candidates so as to selectively do verification. Binary Correctness Criticality (BCC) and Likelihood of Correctness Criticality (LoCC) are metrics that quantify whether an instruction is important for reliability or how likely an instruction is correctness-critical, respectively. A likelihood of correctness criticality is computed by a value vulnerability factor, which is defined by the numerically significant bit-width used to compute a result. The proposed technique is accomplished by exploiting information redundancy of compressing computationally useful data bits. Based on the likelihood of correctness criticality test, the filter checker mitigates the verification workload by bypassing instructions that are unimportant for correct execution. Extensive measurements prove that the LoCC metric yields quite a wide distribution of values, indicating that it has the potential to differentiate diverse degrees of correctness criticality. Experimental results show that the proposed scheme accelerates a traditional fully-fault-tolerant processor by 1.7 times, while it reduces the soft error rate to 18% of that of a non-fault-tolerant processor.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"214 1","pages":"333-340"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79526525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A low overhead hardware technique for software integrity and confidentiality","authors":"Austin Rogers, M. Milenkovic, A. Milenković","doi":"10.1109/ICCD.2007.4601889","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601889","url":null,"abstract":"Software integrity and confidentiality play a central role in making embedded computer systems resilient to various malicious actions, such as software attacks; probing and tampering with buses, memory, and I/O devices; and reverse engineering. In this paper we describe an efficient hardware mechanism that protects software integrity and guarantees software confidentiality. To provide software integrity, each instruction block is signed during program installation with a cryptographically secure signature. The signatures embedded in the code are verified during program execution. Software confidentiality is provided by encrypting instruction blocks. To achieve low performance overhead, the proposed mechanism combines several architectural enhancements: a variation of one-time-pad encryption, parallelizable signatures, and conditional execution of unverified instructions. A relatively high memory overhead due to embedded signatures can be reduced by protecting multiple instruction blocks with one signature, with minimal effects on complexity and performance overhead.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"5 1","pages":"113-120"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77249103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why we need statistical static timing analysis","authors":"C. Forzan, D. Pandini","doi":"10.1109/ICCD.2007.4601885","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601885","url":null,"abstract":"As technology continues to advance deeper into the nanometer regime, a tight control on the process parameters is increasingly difficult. As a consequence, variability has turned out to be a dominant factor in the design of complex ICs. Traditional static timing analysis (STA) is becoming insufficient to accurately evaluate the process variation impact on the design performance considering the increasing number of process, power supply voltage, and temperature (PVT) corners. In contrast, statistical static timing analysis (SSTA) is a promising innovative technique to handle increasingly larger environmental and process fluctuations, especially on-chip parameter variations. However, the statistical approach needs a set of costly additional data such as an accurate process variation description, and a statistical standard cell library characterization. In this paper, STA and SSTA are applied on a real industrial design to compare the two techniques, in terms of both accuracy and cost. From our analysis, we have concluded that the potential advantages offered by SSTA exceed the additional library characterization cost and process data assembly effort.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"26 1","pages":"91-96"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73196981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CMOS logic design with independent-gate FinFETs","authors":"Anish Muttreja, Niket Agarwal, N. Jha","doi":"10.1109/ICCD.2007.4601953","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601953","url":null,"abstract":"Fin-type field-effect transistors (FinFETs) are promising substitutes for bulk CMOS in nano-scale circuits. In this paper, it is observed that in spite of improved device characteristics, high active leakage may remain a problem for FinFET logic circuits. Leakage is found to contribute 31.3% of total power consumption in power-optimized FinFET logic circuits. Various FinFET logic design styles, based on independent control of FinFET gates, are studied. A new low-leakage logic style is presented. Leakage (total) power savings of 64.7% (14.5%) under tight delay constraints and 91.2% (37.2%) under relaxed delay constraints, through the judicious use of FinFET logic styles, are demonstrated.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"29 1","pages":"560-567"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73739656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maximizing the throughput-area efficiency of fully-parallel low-density parity-check decoding with C-slow retiming and asynchronous deep pipelining","authors":"Ming Su, Lili Zhou, C. Shi","doi":"10.1109/ICCD.2007.4601964","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601964","url":null,"abstract":"In this paper, we apply C-slow retiming and asynchronous deep pipelining to maximize the throughput-area efficiency of fully parallel low-density-parity-check (LDPC) decoding. Pipelined decoders are implemented in a 0.18 mum FDSOI CMOS process. Experimental results show that our pipelining technique is an efficient approach to maximizing LDPC decoding throughput while minimizing the area consumption. First, pipelined decoders can achieve extraordinary high throughput which non-pipelined design cannot. Second, for the same throughput, pipelined decoders use less area than non-pipelined design. Our approach can improve the throughput of a published implementation by 4 times with only about 80% area overhead. Without using clocks, proposed asynchronous pipelined decoders are more scalable in design complexity and more robust to process-voltage-temperature variations than existing clock-based LDPC decoders.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"10 1","pages":"636-643"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87839480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}