{"title":"SLIDER: Smart Late Injection DEflection Router for mesh NoCs","authors":"Bhawna Nayak, John Jose, M. Mutyam","doi":"10.1109/ICCD.2013.6657068","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657068","url":null,"abstract":"Network-on-Chip (NoC) provides a scalable communication interface for processing cores in large multicore systems. An efficient NoC router should not only minimize the average packet latency of the network but also have minimum pipeline latency, area, and power. Area and power overheads are affecting the scalability and popularity of traditional input buffered routers. In this context minimally buffered deflection routers are emerging as a cost effective alternative. We propose SLIDER, Smart Late Injection DEflection Router, that uses side buffers for accommodating a fraction of deflected flits. The main contributions of this work are smart late injection and selective flit preemption. In SLIDER the injection stage is kept at the end of the router pipeline. This reduces the contention in the arbitration stage, eliminates unwanted intra-router movement of flits and effectively utilizes the idle output channels. We parallelize independent operations in the router pipeline and reduce the pipeline latency by 25%. Experimental results on synthetic and real workloads show that SLIDER reduces average flit latency, channel wastage, and deflection rate, and increases throughput in the network when compared to the state-of-the-art minimally buffered deflection routers.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116241205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scattered superpage: A case for bridging the gap between superpage and page coloring","authors":"Licheng Chen, Yanan Wang, Zehan Cui, Yongbing Huang, Yungang Bao, Mingyu Chen","doi":"10.1109/ICCD.2013.6657040","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657040","url":null,"abstract":"Superpage and page coloring are two important practical techniques to improve the performance of Translation Lookaside Buffers (TLBs) and shared Last Level Cache (LLC) respectively. However, there exists a gap between these two techniques in current hardware-architecture design, resulting in the contradiction in adopting these two optimizations simultaneously: a superpage requires hundreds of contiguous (e.g. a power of two) base pages in both virtual and physical memory, which would compulsorily occupy all available page colors (or cache sets), thus making page coloring failed to work. This is because most contemporary architecture adopts the design with cache set indexes placed in the least significant part of block address. In this paper, we propose a lightweight approach named Scattered Superpage to bridge this gap. Scattered Superpage decouples a superpage from the limitation of occupying multiple contiguous physical base pages. A superpage is still contiguous in virtual memory, but it is scattered mapping into multiple physical superpages, and it just occupies specified partial page colors in each physical superpage, thus it allows us to configure page color for each superpage. The huge TLB is slightly modified to store page color configuration for each superpage and to calculate target physical address based on this configuration when doing address translation. The experimental results show that the Scattered Superpage can improve system performance by 20.51% and reduce unfairness by 27.77% in our 4-core simulation system (with multi-program memory-intensive workloads). It achieves this by reducing last level cache miss by 17.05% and reducing TLB miss by 86.02% simultaneously.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134375383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Chan, A. Kahng, Seokhyeong Kang, Rakesh Kumar, J. Sartori
{"title":"Statistical analysis and modeling for error composition in approximate computation circuits","authors":"W. Chan, A. Kahng, Seokhyeong Kang, Rakesh Kumar, J. Sartori","doi":"10.1109/ICCD.2013.6657024","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657024","url":null,"abstract":"Aggressive requirements for low power and high performance in VLSI designs have led to increased interest in approximate computation. Approximate hardware modules can achieve improved energy efficiency compared to accurate hardware modules. While a number of previous works have proposed hardware modules for approximate arithmetic, these works focus on solitary approximate arithmetic operations. To utilize the benefit of approximate hardware modules, CAD tools should be able to quickly and accurately estimate the output quality of composed approximate designs. A previous work [10] proposes an interval-based approach for evaluating the output quality of certain approximate arithmetic designs. However, their approach uses sampled error distributions to store the characterization data of hardware, and its accuracy is limited by the number of intervals used during characterization. In this work, we propose an approach for output quality estimation of approximate designs that is based on a lookup table technique that characterizes the statistical properties of approximate hardwares and a regression-based technique for composing statistics to formulate output quality. These two techniques improve the speed and accuracy for several error metrics over a set of multiply-accumulator testcases. Compared to the interval-based modeling approach of [10], our approach for estimating output quality of approximate designs is 3.75× more accurate for comparable runtime on the testcases and achieves 8.4× runtime reduction for the error composition flow. We also demonstrate that our approach is applicable to general testcases.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116418029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaos Strikos, Vasileios Kontorinis, Xiangyu Dong, H. Homayoun, D. Tullsen
{"title":"Low-current probabilistic writes for power-efficient STT-RAM caches","authors":"Nikolaos Strikos, Vasileios Kontorinis, Xiangyu Dong, H. Homayoun, D. Tullsen","doi":"10.1109/ICCD.2013.6657095","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657095","url":null,"abstract":"MRAM has emerged as one of the most attractive non-volatile solutions due to fast read access, low leakage power, high bit density, and long endurance. However, the high power consumption of write operations remains a barrier to the commercial adoption of MRAM technology. This paper addresses this problem by introducing low-current probabilistic writes (LCPW), a technique that reduces write access energy by lowering the amplitude of the write current pulse. Although low current pulses no longer guarantee successful bit write operations, we propose and evaluate a simple technique to ensure correctness and achieve significant power reduction over a typical MRAM implementation.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130498273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Watts-inside: A hardware-software cooperative approach for Multicore Power Debugging","authors":"Jie Chen, Fan Yao, Guru Venkataramani","doi":"10.1109/ICCD.2013.6657062","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657062","url":null,"abstract":"Multicore computing presents unique challenges for performance and power optimizations due to the multiplicity of cores and the complexity of interactions between the hardware resources. Understanding multicore power and its implications on application behavior is critical to the future of multicore software development. In this paper, we propose Watts-inside, a hardware-software cooperative framework that relies on the efficiency of hardware support to accurately gather application power profiles, and utilizes software support and causation principles for a more comprehensive understanding of application power. We show the design of our framework, along with certain optimizations that increase the ease of implementation. We present a case study using two real applications, Ocean (Splash-2) and Streamcluster (Parsec-1.0) where, with the help of feedback from Watts-inside framework, we made simple code modifications and realized up to 5% power savings on chip power consumption.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121367757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chisel-Q: Designing quantum circuits with a scala embedded language","authors":"Xiao Liu, J. Kubiatowicz","doi":"10.1109/ICCD.2013.6657075","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657075","url":null,"abstract":"We introduce Chisel-Q, a high-level functional language for generating quantum circuits. Chisel-Q permits quantum computing algorithms to be constructed using the meta-language features of Scala and its embedded DSL Chisel. With Chisel-Q, designers of quantum computing algorithms gain access to high-level, modern language features and abstractions. We describe a synthesis flow that transforms Chisel-Q into an explicit quantum circuit in the Quantum Assembly Language (QASM) format. We also discuss several optimizations to reduce the generated hardware cost. The Chisel-Q tool includes resource and performance estimation which can be used to compare different implementations of the same functionality. We compare the output of the generic Chisel-Q synthesis flow with hand-tuned versions of well-known quantum circuits.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123612334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Srinivasan, Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu
{"title":"On dynamic polymorphing of a superscalar core for improving energy efficiency","authors":"S. Srinivasan, Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu","doi":"10.1109/ICCD.2013.6657091","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657091","url":null,"abstract":"The computational needs of a program change over time. Sometimes a program exhibits low instruction level parallelism (ILP), while at other times the inherent ILP may be higher; sometimes a program stalls due to a large number of cache misses, while at other times it may exhibit high cache throughput. Asymmetric Multicore Processors (AMP) have been proposed to allow matching the computing needs of a thread to a core where it executes most efficiently. Some of the recent works focus on AMPs consisting of a monolithic large out-of-order (OOO) core and a small in-order (InO) core. Dynamic swapping of threads between these cores is then facilitated to improve energy efficiency of the threads without impacting performance too negatively. Swapping decisions are made at coarse grain instruction granularities to mitigate the impact of migration overhead. This excludes many opportunities for swap at a fine granular level. In this paper we consider a single superscalar OOO core that can morph itself dynamically into an InO core at runtime. In order to determine when to morph from OOO to InO and vice-versa, we rely on certain hardware performance monitors. Using these performance monitors we estimate the energy-delay-squared product (ED2P) for both modes of operation, which is then used to make morphing decisions. The morphing hardware support is simple and is already available in certain Intel processors to facilitate debug. The proposed scheme has low migration overhead, that enables fine-grain morphing to achieve more energy efficient computing by trading a small loss of performance for much greater energy reduction.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128378316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Managing test coverage uncertainty due to thermal noise in nano-CMOS: A case-study on an SRAM array","authors":"Vikram B. Suresh, S. Kundu","doi":"10.1109/ICCD.2013.6657043","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657043","url":null,"abstract":"From system-on-a-chip to high performance processors, SRAM is a critical component. In highly scaled CMOS devices, process variation is a major concern as it affects SRAM stability which often sets the floor on supply voltage and the ceiling on operating temperature of a semiconductor chip. Consequently, low-voltage and high temperature testing are often part of manufacturing test flow. In this paper, we show that for marginal cells, thermal noise is a major corrupting factor that affects the outcome of testing. A cell with large process variation which should ordinarily fail during memory test may pass due to impact of thermal noise at high temperature. To address this uncertainty during testing, we propose a stochastic metric for test coverage. We also propose application of N-detect and Multi-level Word Line (WL) techniques to improve test coverage based on this stochastic metric. Simulation studies on 32nm PTM models indicate varying probability of faulty bit detection across the spectrum of random thermal noise that lead to erroneous test results. Multiple accesses to each bit cell during test increases the fault coverage from -10% to near ideal 100%. Boosting WL voltage during read test and scaling it below nominal voltage during write test accelerates fault detection. Simulation of a 1KB SRAM array test case shows an improvement in fault coverage from -88% to 100% by increasing the number of detects to 100.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128433647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing the costs and benefits of hardware parallelism in accelerator cores","authors":"Steven J. Battle, Mark Hempstead","doi":"10.1109/ICCD.2013.6657021","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657021","url":null,"abstract":"Power and utilization constraints are limiting the performance gains of traditional architectures. Designers are increasingly embracing specialization to improve performance in the era of dark-silicon. General purpose processors are beginning to resemble SOC's from the embedded domain, and now include many specialized accelerator cores to improve computation-throughput while reducing the energy-cost of computation. The design-space of accelerator cores is wide and varied. Designers are able to specify how much parallelism to expose in hardware by varying input width, pipeline depth, number of compute-lanes, etc. In this paper we study three accelerator cores: DES, FFT, and Jacobi Transform, exhibiting three different types of computation: streaming cryptographic, butterfly DSP, and stencil. We investigate methods to increase parallelism within the accelerator while remaining on the pareto-frontier, and examine the trade-offs faced by designers with respect to area, power, and throughput. We present models of these trade-offs and provide insight into the design of cores under real-world constraints.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131919386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Free ECC: An efficient error protection for compressed last-level caches","authors":"Long Chen, Yanan Cao, Zhao Zhang","doi":"10.1109/ICCD.2013.6657054","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657054","url":null,"abstract":"Cache reliability is increasingly a concern as cache cell dimension shrinks and cache capacity grows. Conventionally, an extra, dedicated storage is appended to cache to store error correcting code. Recently, cache compression schemes have been proposed to increase the effective cache capacity of last-level cache (LLC), for which we found the conventional cache ECC design is inefficient. We propose Free ECC that utilizes the unused fragments in compressed cache design to store ECC. It not only reduces the chip overhead but also improves cache utilization and power efficiency. Additionally, we propose an efficient convergent cache allocation scheme to organize the compressed data blocks more effectively than existing schemes. Our evaluation using SPEC CPU2006 and PARSEC benchmarks shows that the Free ECC design improves cache capacity utilization and power efficiency significantly, with negligible overhead on overall performance. This new design makes compressed cache an increasingly viable choice for processors with requirements of high reliability.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134279202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}