Alamelu Sankaranarayanan, E. K. Ardestani, J. L. Briz, Jose Renau
{"title":"An energy efficient GPGPU memory hierarchy with tiny incoherent caches","authors":"Alamelu Sankaranarayanan, E. K. Ardestani, J. L. Briz, Jose Renau","doi":"10.1109/ISLPED.2013.6629259","DOIUrl":"https://doi.org/10.1109/ISLPED.2013.6629259","url":null,"abstract":"With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads. In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62% of memory requests that would otherwise need to be serviced by the DL1G, and almost 81% of scratchpad memory requests, allowing us to achieve a 37% energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79553275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power mapping and modeling of multi-core processors","authors":"K. Dev, Abdullah Nazma Nowroz, S. Reda","doi":"10.5555/2648668.2648680","DOIUrl":"https://doi.org/10.5555/2648668.2648680","url":null,"abstract":"We propose new techniques for post-silicon power mapping and modeling of multi-core processors using infrared imaging and performance counter measurements. An accurate finite-element modeling framework is used to capture the relationship between temperature and power, while compensating for the artifacts introduced from substituting traditional heat removal mechanisms with oil-based infrared-transparent cooling mechanisms. We use thermal conditioning techniques to build leakage power models for the die. Utilizing the power maps identified from infrared mapping, we develop empirical power models for different processor blocks based on the measurements from the performance monitoring counters (PMCs), and utilize the PMC-based models to analyze the transient power consumption. In our experiments, we capture thermal images from a quad-core processor under different workload conditions, and then we reconstruct the dynamic and leakage power maps for different blocks. Our results show good accuracy in mapping and modeling, revealing good insights into the trends of power consumption in multi-core processors.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73933699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An analytical solution for multi-core energy calculation with consideration of leakage and temperature dependency","authors":"Ming Fan, Vivek Chaturvedi, Shi Sha, Gang Quan","doi":"10.1109/ISLPED.2013.6629322","DOIUrl":"https://doi.org/10.1109/ISLPED.2013.6629322","url":null,"abstract":"Energy minimization is a critical issue and challenge when considering the cyclic dependency of leakage power and temperature as IC technology reaches deep sub-micron level. In this paper, we present an analytical method to calculate the energy consumption efficiently and effectively for a given voltage schedule on a multi-core platform, with the leakage/temperature dependency taken into consideration. Our experiments show that the proposed method can achieve a speedup of 15 times compared with the numerical method, with a relative error of no more than 1.5%.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75772833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vignyan Reddy Kothinti Naresh, S. Gilani, Erika Gunadi, N. Kim, M. Schulte, Mikko H. Lipasti
{"title":"REEL: Reducing effective execution latency of floating point operations","authors":"Vignyan Reddy Kothinti Naresh, S. Gilani, Erika Gunadi, N. Kim, M. Schulte, Mikko H. Lipasti","doi":"10.1109/ISLPED.2013.6629292","DOIUrl":"https://doi.org/10.1109/ISLPED.2013.6629292","url":null,"abstract":"The height of the dynamic dependence graph of a program, as executed by a processor, determines the minimum bound on the execution time. This height can be decreased by reducing the effective execution latency of operations that form dependence chains in the graph. In this paper, we propose a technique called REEL to reduce overall latency of chains of dependent floating point (FP) operations by increasing the throughput of computation. REEL comprises of a high-throughput floating point unit (HFP) that allows early issue of an FP Add that is dependent on another FP Add or FP Multiply. This is complemented by instruction scheduler modifications that allow early issue of dependent FP Adds, and a novel checker logic that corrects any precision errors. Unlike conventional static operation fusion, like fused Multiply-Add (FMA), there are no changes to the instruction set to enable utilization of the new hardware, and no recompilation is necessary. Furthermore, unlike ISA-level FMA, our technique produces results that are bit compatible while boosting performance of Add-Add dependence pairs in addition to Multiply-Add pairs. Our evaluation of REEL using CFP2006 benchmarks shows an average performance gain of 7.6% and maximum performance gain of 17% while consuming 1.2% lower energy.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72718154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coordinated refresh: Energy efficient techniques for DRAM refresh scheduling","authors":"Ishwar Bhati, Zeshan A. Chishti, B. Jacob","doi":"10.1109/islped.2013.6629295","DOIUrl":"https://doi.org/10.1109/islped.2013.6629295","url":null,"abstract":"As the size and speed of DRAM devices increase, the performance and energy overheads due to refresh become more significant. To reduce refresh penalty we propose techniques referred collectively as “Coordinated Refresh”, in which scheduling of low power modes and refresh commands are coordinated so that most of the required refreshes are issued when the DRAM device is in the deepest low power Self Refresh (SR) mode. Our approach saves DRAM background power because the peripheral circuitry and clocks are turned off in the SR mode. Our proposed solutions improve DRAM energy efficiency by 10% as compared to baseline, averaged across all the SPEC CPU 2006 benchmarks.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84537169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Write intensity prediction for energy-efficient non-volatile caches","authors":"Junwhan Ahn, S. Yoo, Kiyoung Choi","doi":"10.1109/ISLPED.2013.6629298","DOIUrl":"https://doi.org/10.1109/ISLPED.2013.6629298","url":null,"abstract":"This paper presents a novel concept called write intensity prediction for energy-efficient non-volatile caches as well as the architecture that implements the concept. The key idea is to correlate write intensity of cache blocks with addresses of memory access instructions that incur cache misses of those blocks. The predictor keeps track of instructions that tend to load write-intensive blocks and utilizes that information to predict write intensity of blocks. Based on this concept, we propose a block placement strategy driven by write intensity prediction for SRAM/STT-RAM hybrid caches. Experimental results show that the proposed approach reduces write energy consumption by 55% on average compared to the existing hybrid cache architecture.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90362550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Drake, M. Floyd, Richard L. Willaman, Derek J. Hathaway, J. Hernández, Crystal Soja, Marshall D. Tiner, G. Carpenter, R. Senger
{"title":"Single-cycle, pulse-shaped critical path monitor in the POWER7+ microprocessor","authors":"A. Drake, M. Floyd, Richard L. Willaman, Derek J. Hathaway, J. Hernández, Crystal Soja, Marshall D. Tiner, G. Carpenter, R. Senger","doi":"10.1109/ISLPED.2013.6629293","DOIUrl":"https://doi.org/10.1109/ISLPED.2013.6629293","url":null,"abstract":"A 32nm SOI critical path monitor (CPM) that can provide timing measurements to a Digital PLL for dynamic frequency adjustments in the 8-core POWER7+™ microprocessor is described. The CPM calibrates to within 2% of cycle time from nominal to turbo voltages. Its voltage sensitivity is 10mV/bit. It tracks processor temperature sensitivity to within 1.5% of nominal frequency, and has a sample jitter less than 1.5% of nominal frequency. The ability to detect noise dynamically allows the system to operate the processor closer to its optimal frequency for any given voltage, resulting in lower voltage for power savings or higher frequency for performance improvements.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90511609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Turnquist, Jani Mäkipää, M. Hiienkari, Hanh-Phuc Le, L. Koskinen
{"title":"Rethinking DC-DC converter design constraints for adaptable systems that target the minimum-energy point","authors":"M. Turnquist, Jani Mäkipää, M. Hiienkari, Hanh-Phuc Le, L. Koskinen","doi":"10.1109/ISLPED.2013.6629327","DOIUrl":"https://doi.org/10.1109/ISLPED.2013.6629327","url":null,"abstract":"This paper explores a new DC-DC converter design constraint for adaptable systems that target the minimum-energy point (MEP). Traditionally, DC-DC converters have regulated to a fixed output voltage over a wide range of input voltages. For energy-constrained systems that target the MEP, regulating them to a fixed voltage is unnecessary since changes in the output voltage near the MEP have little impact on the energy per cycle. This paper applies a new and traditional design constraint to a 3:1 series-parallel switched-capacitor (SC) DC-DC converter in 28 nm CMOS. The new design constraint allows for decreased design time, less area, and less system-level energy per cycle compared to traditional constraints.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86781943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robustness-driven energy-efficient ultra-low voltage standard cell design with intra-cell mixed-Vt methodology","authors":"Wenfeng Zhao, Yajun Ha, Chin Hau Hoo, A. Alvarez","doi":"10.1109/ISLPED.2013.6629317","DOIUrl":"https://doi.org/10.1109/ISLPED.2013.6629317","url":null,"abstract":"High functional yield is one of the key challenges for subthreshold standard cell designs. Device upsizing is a commonly used but suboptimal method due to its overheads in energy and area. In this paper, we propose a robustness-driven intra-cell mixed-Vt design methodology (MVT-ULV) for the robust ultra-low voltage operation. It uses low threshold voltage transistors in the weak pulling network of logic gates to enhance the robustness. It guarantees the high functional yield with the minimum energy/area overheads. We demonstrate on a commercial 65nm CMOS process that, our proposed design methodology shows up to 60mV and 110mV robustness improvement at 300mV power supply voltage over the commercial library cells and the cells built with previous Leakage-Minimization mixed-Vt methods (MVT-LM) under the same cell area constraints, respectively. In addition, the proposed MVT-ULV library enables ITC'99 benchmark circuits to show on average 30.1% and 78.1% energy-efficiency improvement when compared to the libraries built with the device-upsizing methods and the previous MVT-LM methods under the same yield constraints, respectively.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81655411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kyungtae Han, Alexander W. Min, Nithyananda S. Jeganathan, Paul Diefenbaugh
{"title":"A hybrid display frame buffer architecture for energy efficient display subsystems","authors":"Kyungtae Han, Alexander W. Min, Nithyananda S. Jeganathan, Paul Diefenbaugh","doi":"10.1109/ISLPED.2013.6629321","DOIUrl":"https://doi.org/10.1109/ISLPED.2013.6629321","url":null,"abstract":"Our principal motivation is to reduce the energy consumption of display subsystems in mobile devices by introducing a hybrid frame buffer architecture into the platform. We observed that display contents on a screen are quite static for certain mobile workloads, such as web browsing. As a result, data reading from the display frame is much more frequent than the writing of new data onto the frame buffer, a state we refer to as read dominance. Based on this observation, we propose a hybrid frame buffer architecture that exploits the display contents' read-dominant property to improve the energy efficiency of display subsystems. Specifically, we employ two memory types: DRAM and Phase-Change Memory (PCM), in the display frame buffer to exploit their different read/write energy characteristics. We also present an analysis of the energy efficiency of the hybrid frame buffer based on our display content and energy consumption models. Our evaluation results show that the proposed hybrid frame buffer reduces frame buffer energy consumption by up to 43%, compared to the conventional DRAM-only frame buffer.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85377475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}