Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, O. Mutlu, T. Mowry, S. Keckler
{"title":"A case for toggle-aware compression for GPU systems","authors":"Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, O. Mutlu, T. Mowry, S. Keckler","doi":"10.1109/HPCA.2016.7446064","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446064","url":null,"abstract":"Data compression can be an effective method to achieve higher system performance and energy efficiency in modern data-intensive applications by exploiting redundancy and data similarity. Prior works have studied a variety of data compression techniques to improve both capacity (e.g., of caches and main memory) and bandwidth utilization (e.g., of the on-chip and off-chip interconnects). In this paper, we make a new observation about the energy-efficiency of communication when compression is applied. While compression reduces the amount of transferred data, it leads to a substantial increase in the number of bit toggles (i.e., communication channel switchings from 0 to 1 or from 1 to 0). The increased toggle count increases the dynamic energy consumed by on-chip and off-chip buses due to more frequent charging and discharging of the wires. Our results show that the total bit toggle count can increase from 20% to 2.2x when compression is applied for some compression algorithms, averaged across different application suites. We characterize and demonstrate this new problem across 242 GPU applications and six different compression algorithms. To mitigate the problem, we propose two new toggle-aware compression techniques: Energy Control and Metadata Consolidation. These techniques greatly reduce the bit toggle count impact of the data compression algorithms we examine, while keeping most of their bandwidth reduction benefits.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132219404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Atomic persistence for SCM with a non-intrusive backend controller","authors":"K. Doshi, Ellis R. Giles, P. Varman","doi":"10.1109/HPCA.2016.7446055","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446055","url":null,"abstract":"Non-volatile byte-addressable memory has the potential to revolutionize system architecture by providing instruction-grained direct access to vast amounts of persistent data. We describe a non-intrusive memory controller that uses backend operations for achieving lightweight failure atomicity. By moving synchronous persistent memory operations to the background, the performance overheads are minimized. Our solution avoids costly software intervention by decoupling isolation and concurrency-driven atomicity from failure atomicity and durability, and does not require changes to the front-end cache hierarchy. Two implementation alternatives - one using a hardware structure, and the other extending the memory controller with a firmware managed volatile space - are described. Our results show the performance is significantly better than traditional approaches.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123057329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HRL: Efficient and flexible reconfigurable logic for near-data processing","authors":"Mingyu Gao, C. Kozyrakis","doi":"10.1109/HPCA.2016.7446059","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446059","url":null,"abstract":"The energy constraints due to the end of Dennard scaling, the popularity of in-memory analytics, and the advances in 3D integration technology have led to renewed interest in near-data processing (NDP) architectures that move processing closer to main memory. Due to the limited power and area budgets of the logic layer, the NDP compute units should be area and energy efficient while providing sufficient compute capability to match the high bandwidth of vertical memory channels. They should also be flexible to accommodate a wide range of applications. Towards this goal, NDP units based on fine-grained (FPGA) and coarse-grained (CGRA) reconfigurable logic have been proposed as a compromise between the efficiency of custom engines and the flexibility of programmable cores. Unfortunately, FPGAs incur significant area overheads for bit-level reconfiguration, while CGRAs consume significant power in the interconnect and are inefficient for irregular data layouts and control flows. This paper presents Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays. HRL combines both coarse-grained and fine-grained logic blocks, separates routing networks for data and control signals, and uses specialized units to effectively support branch operations and irregular data layouts in analytics workloads. HRL has the power efficiency of FPGA and the area efficiency of CGRA. It improves performance per Watt by 2.2x over FPGA and 1.7x over CGRA. For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128124247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keunsoo Kim, Sangpil Lee, M. Yoon, Gunjae Koo, W. Ro, M. Annavaram
{"title":"Warped-preexecution: A GPU pre-execution approach for improving latency hiding","authors":"Keunsoo Kim, Sangpil Lee, M. Yoon, Gunjae Koo, W. Ro, M. Annavaram","doi":"10.1109/HPCA.2016.7446062","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446062","url":null,"abstract":"This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130360647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhargava Gopireddy, Choungki Song, J. Torrellas, N. Kim, Aditya Agrawal, Asit K. Mishra
{"title":"ScalCore: Designing a core for voltage scalability","authors":"Bhargava Gopireddy, Choungki Song, J. Torrellas, N. Kim, Aditya Agrawal, Asit K. Mishra","doi":"10.1109/HPCA.2016.7446104","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446104","url":null,"abstract":"Upcoming multicores need to provide increasingly stringent energy-efficient execution modes. Currently, energy efficiency is attained by lowering the voltage (Vdd) through DVFS. However, the effectiveness of DVFS is limited: designing cores for low Vdd results in energy inefficiency at nominal Vdd. Our goal is to design a core for Voltage Scalability, i.e., one that can work in high-performance mode (HPMode) at nominal Vdd, and in a very energy-efficient mode (EEMode) at low Vdd. We call this core ScalCore. To operate energy-efficiently in EEMode, ScalCore introduces two ideas. First, since logic and storage structures scale differently with Vdd, ScalCore applies two low Vdds to the pipeline: one to the logic stages (Vlogic) and a higher one to storage-intensive stages. Secondly, ScalCore further increases the low Vdd of the storage-intensive stages (Vop), so that they are substantially faster than the logic ones. Then, it exploits the speed differential by either fusing storage-intensive pipeline stages or increasing the size of storage structures in the pipeline. Our simulations of 16 cores show that a design with ScalCores in EEMode is much more energy-efficient than one with conventional cores and aggressive DVFS: for approximately the same power, ScalCores reduce the average execution time of programs by 31%, the energy (E) consumed by 48%, and the ED product by 60%. In addition, dynamically switching between EEMode and HPMode based on program phases is very effective: it reduces the average execution time and ED product by a further 28% and 15%, respectively.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126593593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Ferraiuolo, Yao Wang, Danfeng Zhang, A. Myers, G. Suh
{"title":"Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller","authors":"Andrew Ferraiuolo, Yao Wang, Danfeng Zhang, A. Myers, G. Suh","doi":"10.1109/HPCA.2016.7446080","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446080","url":null,"abstract":"Computer hardware is increasingly shared by distrusting parties in platforms such as commercial clouds and web servers. Though hardware sharing is critical for performance and efficiency, this sharing creates timing-channel vulnerabilities in hardware components such as memory controllers and shared memory. Past work on timing-channel protection for memory controllers assumes all parties are mutually distrusting and require timing-channel protection. This assumption limits the capability of the memory controller to allocate resources effectively, and causes severe performance penalties. Further, the assumption that all entities are mutually distrusting is often a poor fit for the security needs of real systems. Often, some entities do not require timing-channel protection or trust others with information. We propose lattice priority scheduling (LPS), a secure memory scheduling algorithm that improves performance by more precisely meeting the target system's security requirements, expressed as a lattice policy. We evaluate LPS in a simulated 8-core microprocessor. Compared to prior solutions [34], lattice priority scheduling improves system throughput by over 30% on average and by up to 84% for some workloads.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123381471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}