{"title":"TESLA: Using microfluidics to thermally stabilize 3D stacked STT-RAM caches","authors":"Majed Valad Beigi, G. Memik","doi":"10.1109/ICCD.2016.7753299","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753299","url":null,"abstract":"In this work, we develop a 3D architecture that utilizes STT-RAM for the last level cache (LLC). 3D integration enables large LLCs to be stacked onto a die. However, 3D architectures suffer from higher operating temperatures due to increased power densities. The elevated temperatures can adversely impact the STT-RAM performance and reliability. The objective of this paper is to address the limits of integrating STT-RAM in 3D chip stacks from a thermal perspective and propose a novel stacking structure that minimizes heat-induced problems. Specifically, we analyze the system-level impact of increased temperatures and propose a novel technique to dynamically adjust the flow rate of the liquid interlayer cooling at run time to reduce the STT-RAM temperature and alleviate temperature-induced problems that cause the performance degradation and prevent overcooling the STT-RAM die and minimize the pump energy consumption. Evaluation results reveal that our approach achieves up to 19.1% performance improvement and 14.6% power reduction over an architecture that does not include an insulating layer.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130196928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Novel approximate synthesis flow for energy-efficient FIR filter","authors":"Yesung Kang, Jaewoo Kim, Seokhyeong Kang","doi":"10.1109/ICCD.2016.7753266","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753266","url":null,"abstract":"The portability of emerging computing systems demands further reduction in the power consumption of their components. Approximate computing can reduce power consumption by using a simplified or an inaccurate circuit. In this paper, the energy efficiency of a finite impulse response (FIR) filter is improved through approximate computing. We propose an approximate synthesis technique for an energy-efficient FIR filter with an acceptable level of accuracy. We employ the common subexpression elimination (CSE) algorithm to implement the FIR filter and replace conventional adder/subtractors with approximate ones. While yielding acceptable rates of accuracy, the proposed flow can attain a maximum energy saving of 50.7% in comparison with conventional FIR filter designs.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"302 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127428634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. T. Possignolo, E. Ebrahimi, H. Skinner, Jose Renau
{"title":"Fluid Pipelines: Elastic circuitry meets Out-of-Order execution","authors":"R. T. Possignolo, E. Ebrahimi, H. Skinner, Jose Renau","doi":"10.1109/ICCD.2016.7753285","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753285","url":null,"abstract":"Pipeline depth and cycle time are fixed early in the chip design process but their impact can only be assessed when the implementation is mostly done and changing them is impractical. Elastic Systems are latency insensitive systems, and allow changes in the pipeline depth late in the design process with little design effort. Nevertheless, they have significant throughput penalty when new stages are added in the presence of pipeline loops. We propose Fluid Pipelines, an evolution that allows pipeline transformations without a throughput penalty. Formally, we introduce “or-causality” in addition to the already existing “and-causality” in Elastic Systems. It gives more flexibility than previously possible at the cost of having the designer to specify the intended behavior of the circuit. In an Out-of-Order core benchmark, Fluid Pipelines improve the optimal energy-delay point by shifting both performance (by 17%) and energy (by 13%). We envision a scenario where tools would be able to generate different pipeline configurations from the same RTL e.g., low power, high performance.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127592454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A single-inductor-cascaded-stage topology for high conversion ratio boost regulator","authors":"K. Z. Ahmed, S. Mukhopadhyay","doi":"10.1109/ICCD.2016.7753331","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753331","url":null,"abstract":"A single-inductor-cascaded-stage boost regulator topology is presented that time-multiplexes a single inductor using one-nFET-two-pFET power stage and a bias-gated Pulse-Frequency Modulation controller to achieve high conversion ratio. A test-chip in 130nm CMOS demonstrates 120× conversion using a single inductor while consuming 140nA bias current.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125943278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IACM: Integrated adaptive cache management for high-performance and energy-efficient GPGPU computing","authors":"Kyu Yeun Kim, Jinsu Park, Woongki Baek","doi":"10.1109/ICCD.2016.7753308","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753308","url":null,"abstract":"Hardware caches are widely employed in GPGPUs to achieve higher performance and energy efficiency. Incorporating hardware caches in GPGPUs, however, does not immediately guarantee enhanced performance and energy efficiency due to high cache contention and thrashing. To address the inefficiency of GPGPU caches, various adaptive techniques (e.g., warp limiting) have been proposed. However, relatively little work has been done in the context of creating an architectural framework that tightly integrates adaptive cache management techniques and investigating their effectiveness and interaction. To bridge this gap, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. IACM integrates the state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average).","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126183148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic neuromorphic learning machines for weakly labeled data","authors":"E. Neftci","doi":"10.1109/ICCD.2016.7753355","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753355","url":null,"abstract":"At learning tasks where humans typically outperform computers, neuromorphic learning machines can have potential advantages in learning in terms of power and complexity compared to mainstream technologies. Here, we present Synaptic Sampling Machines (S2M), a class of stochastic neural networks that use stochasticity at the connections (synapses) to implement energy efficient semi- and unsupervised learning for weakly or unlabeled data. Stochastic synapses play the dual role of a regularizer during learning and a mechanism for implementing stochasticity in neural networks. We show a S2M network architecture that is well suited for a dedicated digital implementation, that is potentially hundredfold more energy efficient compared to equivalent algorithms operating on GPUs.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123724452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A strong arbiter PUF using resistive RAM within 1T-1R memory architecture","authors":"Rekha Govindaraj, Swaroop Ghosh","doi":"10.1109/ICCD.2016.7753272","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753272","url":null,"abstract":"Physically Unclonable Function (PUF) is cost effective and reliable security primitives widely used in authentication and in-place secret key generation. With growing research in the area of non-CMOS technologies for memories and circuits, it is important to understand their implications on the design of security primitives. Resistive Random Accessible Memory (RRAM) offers easy integration with CMOS due to minimal changes in the process technology. RRAM also demonstrates resistance variability characteristics due to inherent defects in the conducting filament formed inside the metal oxide layer. RRAM based PUF designs exploit either the probabilistic switching of RRAM or the resistance variability during forming, SET and RESET processes. Memory PUFs using RRAM are typically weak PUFs due to fewer number of Challenge Response Pairs (CRPs). We propose strong arbiter PUF based on 1T-1R bit cell which is obtained from conventional RRAM memory array with minimally invasive changes. Conventional voltage sense amplifier is employed to generate the response. The PUF is simulated using 65nm predictive technology models for CMOS and Verilog-A model for a hafnium oxide based RRAM. The proposed PUF architecture is evaluated for uniqueness, uniformity and reliability and by running NIST benchmarks. It demonstrates mean intra-die Hamming Distance (HD) of 0.13% and inter-die HD of 51.3%, and, passes the NIST tests.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130457940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lumos+: Rapid, pre-RTL design space exploration on accelerator-rich heterogeneous architectures with reconfigurable logic","authors":"Liang Wang, K. Skadron","doi":"10.1109/ICCD.2016.7753297","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753297","url":null,"abstract":"We propose Lumos+, an analytical framework for power and performance modeling of accelerator-rich heterogeneous architectures. As accelerators proliferate, the search space becomes too expensive for brute-force search. We describe a novel and highly accurate genetic search algorithm. We then use Lumos+ to explore the tradeoffs between using fixed-function accelerators and reconfigurable logic blocks, while accounting for diverse workload characteristics, hardware overheads, and system constraints, and show that reconfigurable logic can improve power and performance while improving overall system flexibility and the ability to adapt to diverse and changing workloads.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134061714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"nOS: A nano-sized distributed operating system for many-core embedded systems","authors":"S. Hollis, Edward Ma, R. Marculescu","doi":"10.1109/ICCD.2016.7753278","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753278","url":null,"abstract":"We introduce nOS, a “nano-sized” fully distributed operating system aimed at large-scale, many-core embedded systems. nOS enables dynamic runtime optimisation of energy and execution time through lightweight and scalable distributed protocols. nOS implements new dynamic resource optimisation algorithms, and provides an intuitive and easy-to-use programmer API that supports runtime task energy optimisation through dynamic frequency scaling, transparent task communication tracking, and automatic task mapping. Critically, nOS has a completely distributed implementation, providing excellent scalability. Contrary to other approaches, the dynamic runtime optimisations require no a priori knowledge of workload or communication patterns. By generating runtime measurements of thread performance, core load, and process communication, we show that nOS can deliver improvements that would not be possible with only static analysis. Using a many-core system called Swallow, we show a <;3kB fullstack implementation of nOS together with application, OS and hardware. Using two applications with different communication patterns, we illustrate the power and flexibility of our approach, as well as various tradeoffs in energy and performance from making better mapping choices than would be available offline.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133670729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Fleischer, Christos Vezyrtzis, K. Balakrishnan, K. Jenkins
{"title":"A statistical critical path monitor in 14nm CMOS","authors":"B. Fleischer, Christos Vezyrtzis, K. Balakrishnan, K. Jenkins","doi":"10.1109/ICCD.2016.7753334","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753334","url":null,"abstract":"Local variation of delay paths has a significant impact on modern microprocessor performance and yield. A critical path monitor is reported which extracts timing variability information on various critical paths, including sample processor paths. The very compact circuit contains 256 copies of 15 different delay paths, enabling measurement of the statistics of delay variation, as a function of threshold voltage, supply voltage, fanout, temperature, and circuit topology. Measurements of 14nm SOI finFET [1] circuit path delays are presented. The reported sensor can offer a variety of advantages on a processor chip, ranging from testing time improvement to power savings.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132810209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}