{"title":"Negative-skewed shadow registers for at-speed delay variation characterization","authors":"Jie Li, J. Lach","doi":"10.1109/ICCD.2007.4601924","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601924","url":null,"abstract":"The increased process, voltage, and temperature (PVT) variability that comes with integrated circuit (IC) technology scaling has become a major problem in the semiconductor industry. In order to refine manufacturing processes and develop circuit design techniques to cope with variability, we must be able to accurately and precisely characterize the variations that occur. In this paper, we introduce a technique for characterizing combinational path delay variations by measuring a designer-controlled number of register-to-register delays in manufactured ICs with negative-skewed shadow registers. This technique enables delay measurements to be performed with at-speed tests that are run in parallel with and are orthogonal to other testing techniques, and therefore does not add combinatorial complexity to the testing process. This technique can be implemented cost-effectively on a large number of otherwise unobservable internal combinational paths to get accurate, precise data about delay variability.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"32 1","pages":"354-359"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88482713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A radix-10 SRT divider based on alternative BCD codings","authors":"Álvaro Vázquez, E. Antelo, P. Montuschi","doi":"10.1109/ICCD.2007.4601914","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601914","url":null,"abstract":"In this paper we present the algorithm and architecture a radix-10 floating-point divider based on an SRT non-restoring digit-by-digit algorithm. The algorithm uses conventional techniques developed to speed-up radix-2k division such as signed-digit (SD) redundant quotient and digit selection by constant comparison using a carry-save estimate of the partial remainder. To optimize area and latency for decimal, we include novel features such as the use of alternative BCD codings to represent decimal operands, estimates by truncation at any binary position inside a decimal digit, a single customized fast carry propagate decimal adder for partial remainder computation, initial odd multiple generation and final normalization with rounding, and register placement to exploit advanced high fanin mux-latch circuits. The rough area-delay estimations performed show that the proposed divider has a similar latency but less hardware complexity (1.3 area ratio) than a recently published high performance digit-by-digit implementation.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"69 1","pages":"280-287"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91176668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAP: Criticality analysis for power-efficient speculative multithreading","authors":"James Tuck, Wei Liu, J. Torrellas","doi":"10.1109/ICCD.2007.4601932","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601932","url":null,"abstract":"While speculative multithreading (SM) on a chip multiprocessor (CMP) has the ability to speed-up hard-to- parallelize applications, the power inefficiency of aggressive speculation is a concern. To improve SMs power effeciency, we note that not all the tasks that are running in a SM environment are equally critical. To leverage this insight, this paper develops a novel, widely-applicable task-criticality model for SM. It also proposes CAP, a novel architecture that builds a task-criticality graph dynamically and uses it to make scheduling decisions in a SM CMP. Experiments with SPECint, SPECfp, and Olden applications show that, in a CMP with one fast core and three slow ones, the E D2 with CAP is, on average, 91-95% of that without. Moreover, it is only 77-91% of the E D2 of a CMP with four fast cores and no CAP. Overall, we argue that scheduling for task criticality is beneficial.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"41 1","pages":"409-416"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73711647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power-aware mapping for reconfigurable NoC architectures","authors":"M. Modarressi, H. Sarbazi-Azad","doi":"10.1109/ICCD.2007.4601933","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601933","url":null,"abstract":"A core mapping method for reconfigurable network-on-chip (NoC) architectures is presented in this paper. In most of the existing methods, mapping is carried out based on the traffic characteristics of a single application. However, several different applications are implemented and integrated in the modern complex system-on-chips which should be considered by mapping methods. In the proposed method, the reconfiguration (which is achieved by embedding programmable switches between routers of a mesh-based NoC) allows us to dynamically change the network topology in order to adapt it with the running application and optimize the power and performance metrics. The presented network architecture can be configured as an application- specific topology, while it still holds the benefits of the regular NoC topologies such as modularity and predictable electrical properties. The experimental results show that this method can effectively adapt the NoC to the running application and improve the power consumption and performance of the system.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"1 1","pages":"417-422"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79920797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid resistor/FET-logic demultiplexer architecture design for hybrid CMOS/nanodevice circuits","authors":"Shu Li, Tong Zhang","doi":"10.1109/ICCD.2007.4601955","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601955","url":null,"abstract":"Hybrid nanoelectronics are emerging as one viable option to sustain the Moorepsilas Law after the CMOS scaling limit is reached. One main design challenge in hybrid nanoelectronics is the interface (named as demux) between the highly dense nanowires in nanodevice crossbars and relatively coarse microwires in CMOS domain. The prior work on demux design use a single type of devices to realize the demultiplexing function, but hardly provides a satisfactory solution. This work proposes to combine resistor with FET to implement the demux, leading to the so-called hybrid resistor/FET-logic demux. Such hybrid demux architecture can make these two types of devices well complement each other to improve the overall demux design effectiveness. Furthermore, the effects of resistor conductance variability are analyzed and evaluated based on computer simulations.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"27 1","pages":"574-579"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83690749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving cache efficiency via resizing + remapping","authors":"Subramanian Ramaswamy, S. Yalamanchili","doi":"10.1109/ICCD.2007.4601879","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601879","url":null,"abstract":"In this paper we propose techniques to dynamically downsize or upsize a cache accompanied by cache set/line shutdown to produce efficient caches. Unlike previous approaches, resizing is accompanied by a non-uniform remapping of memory into the resized cache, thus avoiding misses to sets/lines that are shut off. The paper first provides an analysis into the causes of energy inefficiencies revealing a simple model for improving efficiency. Based on this model we propose the concept of \"folding\" - memory regions mapping to disjoint cache resources are combined to share cache sets producing a new placement function. Folding enables powering down cache sets at the expense of possibly increasing conflict misses. Effective folding heuristics can substantially increase energy efficiency at the expense of acceptable increase in execution time. We target the 12 cache because of its larger size and greater energy consumption. Our techniques increase cache energy efficiency by 20%, and reduce the EDP (energy delay product) by up to 45% with an IPC degradation of less than 4%. The results also indicate opportunity for improving cache efficiencies further via cooperative compiler interactions.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"12 1","pages":"47-54"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83623566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed voting for fault-tolerant nanoscale systems","authors":"A. Namazi, M. Nourani","doi":"10.1109/ICCD.2007.4601954","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601954","url":null,"abstract":"In this paper, we propose a distributed voting strategy to design a robust NMR system. We show that using inexpensive current-based drivers and buffers, we can completely eliminate the centralized voter unit and do the majority voting among N modules in a distributed fashion. Our strategy achieves high reliability that is vital for future nano systems in which high defect rate is expected. Experimental results are also reported to verify the concept, clarify the design procedure and measure the system's reliability.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"1 1","pages":"568-573"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86418302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing leakage power in peripheral circuits of L2 caches","authors":"H. Homayoun, A. Veidenbaum","doi":"10.1109/ICCD.2007.4601907","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601907","url":null,"abstract":"Leakage power has grown significantly and is a major challenge in microprocessor design. Leakage is the dominant power component in second-level (L2) caches. This paper presents two architectural techniques to utilize leakage reduction circuits in L2 caches. They primarily target the leakage in the peripheral circuitry of an L2 cache and as such have to be able to cope with longer delays. One technique exploits the fact that processor activity decreases significantly after an L2 cache miss occurs and saves power during L2 miss service time. Two algorithms, a static one and an adaptive one, are proposed for deciding when to apply this leakage reduction technique. Another technique attempts to keep the peripheral circuits in a lower-power state most of the time. The results for SPEC2K benchmarks show that the first technique can achieve a 18 to 22% reduction in L2 power consumption, on average (and up to 63%), depending on the decision algorithm. The second technique can save 25%, on average (and up to 80%). This comes with a negligible 1 to 2% performance impact, on average, depending on the technique used.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"5 1","pages":"230-237"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90554162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. H. Gholamipour, E. Bozorgzadeh, Sudarshan Banerjee
{"title":"Energy-aware co-processor selection for embedded processors on FPGAs","authors":"A. H. Gholamipour, E. Bozorgzadeh, Sudarshan Banerjee","doi":"10.1109/ICCD.2007.4601895","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601895","url":null,"abstract":"In this paper, we present co-processor selection problem for minimum energy consumption in hw/sw co-design on FPGAs with dual power mode. We provide theoretical analysis for the problem under no constraint, resource constraint, and timing constraint. We prove that the complexity of the problem in each case is NP-Hard and we provide a generalized ILP formulation. We compared the result of our approach in minimizing energy to the result of other approaches that had not considered both static and dynamic power during optimization and we showed that we can reduce energy by 63% in some cases.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"413 1","pages":"158-163"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79214170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-level ata prefetching","authors":"Fei Gao, Hanyu Cui, S. Sair","doi":"10.1109/ICCD.2007.4601908","DOIUrl":"https://doi.org/10.1109/ICCD.2007.4601908","url":null,"abstract":"Data prefetching has been shown to be an effective tool in hiding part of the latency associated with cache misses in modern processors. Traditionally, data prefetchers fetch data into a small prefetch buffer near the LI for low latency, or the L2 cache for greater coverage and less cache pollution. However, with the L1-L2 cache speed gap growing, significant performance gains can be obtained if the data pref etcher can operate as aggressively as an L2-level pref etcher but with the fast hit times of an LI-level pref etcher. In this paper, we propose a prefetching framework where an LI-level prefetcher and an L2- level prefetcher work cooperatively to reduce the average access time more than either one alone can. We evaluate several design alternatives suited to perform synergistically under different workloads. From the insight we gather from this analysis, we propose a confidence-based adaptive prefetcher that can improve prefetch efficiency significantly with judicious use of available bus bandwidth. Our results show that for certain prefetcher combinations, two- level prefetching can achieve the cumulative speedup attained from either prefetcher alone. Furthermore, when compared to other two-level prefetching models, the adaptive design provides similar speedups with appreciably less bus traffic.","PeriodicalId":6306,"journal":{"name":"2007 25th International Conference on Computer Design","volume":"36 1","pages":"238-244"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77428284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}