2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第6页

A case for toggle-aware compression for GPU systems GPU系统切换感知压缩的一个案例

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446064

Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, O. Mutlu, T. Mowry, S. Keckler

{"title":"A case for toggle-aware compression for GPU systems","authors":"Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, O. Mutlu, T. Mowry, S. Keckler","doi":"10.1109/HPCA.2016.7446064","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446064","url":null,"abstract":"Data compression can be an effective method to achieve higher system performance and energy efficiency in modern data-intensive applications by exploiting redundancy and data similarity. Prior works have studied a variety of data compression techniques to improve both capacity (e.g., of caches and main memory) and bandwidth utilization (e.g., of the on-chip and off-chip interconnects). In this paper, we make a new observation about the energy-efficiency of communication when compression is applied. While compression reduces the amount of transferred data, it leads to a substantial increase in the number of bit toggles (i.e., communication channel switchings from 0 to 1 or from 1 to 0). The increased toggle count increases the dynamic energy consumed by on-chip and off-chip buses due to more frequent charging and discharging of the wires. Our results show that the total bit toggle count can increase from 20% to 2.2x when compression is applied for some compression algorithms, averaged across different application suites. We characterize and demonstrate this new problem across 242 GPU applications and six different compression algorithms. To mitigate the problem, we propose two new toggle-aware compression techniques: Energy Control and Metadata Consolidation. These techniques greatly reduce the bit toggle count impact of the data compression algorithms we examine, while keeping most of their bandwidth reduction benefits.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132219404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

Atomic persistence for SCM with a non-intrusive backend controller 具有非侵入式后端控制器的SCM原子持久性

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446055

K. Doshi, Ellis R. Giles, P. Varman

引用次数: 60

HRL: Efficient and flexible reconfigurable logic for near-data processing HRL:用于近数据处理的高效、灵活的可重构逻辑

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446059

Mingyu Gao, C. Kozyrakis

{"title":"HRL: Efficient and flexible reconfigurable logic for near-data processing","authors":"Mingyu Gao, C. Kozyrakis","doi":"10.1109/HPCA.2016.7446059","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446059","url":null,"abstract":"The energy constraints due to the end of Dennard scaling, the popularity of in-memory analytics, and the advances in 3D integration technology have led to renewed interest in near-data processing (NDP) architectures that move processing closer to main memory. Due to the limited power and area budgets of the logic layer, the NDP compute units should be area and energy efficient while providing sufficient compute capability to match the high bandwidth of vertical memory channels. They should also be flexible to accommodate a wide range of applications. Towards this goal, NDP units based on fine-grained (FPGA) and coarse-grained (CGRA) reconfigurable logic have been proposed as a compromise between the efficiency of custom engines and the flexibility of programmable cores. Unfortunately, FPGAs incur significant area overheads for bit-level reconfiguration, while CGRAs consume significant power in the interconnect and are inefficient for irregular data layouts and control flows. This paper presents Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays. HRL combines both coarse-grained and fine-grained logic blocks, separates routing networks for data and control signals, and uses specialized units to effectively support branch operations and irregular data layouts in analytics workloads. HRL has the power efficiency of FPGA and the area efficiency of CGRA. It improves performance per Watt by 2.2x over FPGA and 1.7x over CGRA. For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128124247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 171

Warped-preexecution: A GPU pre-execution approach for improving latency hiding 扭曲预执行:一种改善延迟隐藏的GPU预执行方法

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446062

Keunsoo Kim, Sangpil Lee, M. Yoon, Gunjae Koo, W. Ro, M. Annavaram

{"title":"Warped-preexecution: A GPU pre-execution approach for improving latency hiding","authors":"Keunsoo Kim, Sangpil Lee, M. Yoon, Gunjae Koo, W. Ro, M. Annavaram","doi":"10.1109/HPCA.2016.7446062","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446062","url":null,"abstract":"This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130360647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

ScalCore: Designing a core for voltage scalability ScalCore:为电压可扩展性设计一个核心

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446104

Bhargava Gopireddy, Choungki Song, J. Torrellas, N. Kim, Aditya Agrawal, Asit K. Mishra

{"title":"ScalCore: Designing a core for voltage scalability","authors":"Bhargava Gopireddy, Choungki Song, J. Torrellas, N. Kim, Aditya Agrawal, Asit K. Mishra","doi":"10.1109/HPCA.2016.7446104","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446104","url":null,"abstract":"Upcoming multicores need to provide increasingly stringent energy-efficient execution modes. Currently, energy efficiency is attained by lowering the voltage (Vdd) through DVFS. However, the effectiveness of DVFS is limited: designing cores for low Vdd results in energy inefficiency at nominal Vdd. Our goal is to design a core for Voltage Scalability, i.e., one that can work in high-performance mode (HPMode) at nominal Vdd, and in a very energy-efficient mode (EEMode) at low Vdd. We call this core ScalCore. To operate energy-efficiently in EEMode, ScalCore introduces two ideas. First, since logic and storage structures scale differently with Vdd, ScalCore applies two low Vdds to the pipeline: one to the logic stages (Vlogic) and a higher one to storage-intensive stages. Secondly, ScalCore further increases the low Vdd of the storage-intensive stages (Vop), so that they are substantially faster than the logic ones. Then, it exploits the speed differential by either fusing storage-intensive pipeline stages or increasing the size of storage structures in the pipeline. Our simulations of 16 cores show that a design with ScalCores in EEMode is much more energy-efficient than one with conventional cores and aggressive DVFS: for approximately the same power, ScalCores reduce the average execution time of programs by 31%, the energy (E) consumed by 48%, and the ED product by 60%. In addition, dynamically switching between EEMode and HPMode based on program phases is very effective: it reduces the average execution time and ED product by a further 28% and 15%, respectively.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126593593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller 格子优先级调度:共享内存控制器的低开销时间通道保护

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2016.7446080

Andrew Ferraiuolo, Yao Wang, Danfeng Zhang, A. Myers, G. Suh

{"title":"Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller","authors":"Andrew Ferraiuolo, Yao Wang, Danfeng Zhang, A. Myers, G. Suh","doi":"10.1109/HPCA.2016.7446080","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446080","url":null,"abstract":"Computer hardware is increasingly shared by distrusting parties in platforms such as commercial clouds and web servers. Though hardware sharing is critical for performance and efficiency, this sharing creates timing-channel vulnerabilities in hardware components such as memory controllers and shared memory. Past work on timing-channel protection for memory controllers assumes all parties are mutually distrusting and require timing-channel protection. This assumption limits the capability of the memory controller to allocate resources effectively, and causes severe performance penalties. Further, the assumption that all entities are mutually distrusting is often a poor fit for the security needs of real systems. Often, some entities do not require timing-channel protection or trust others with information. We propose lattice priority scheduling (LPS), a secure memory scheduling algorithm that improves performance by more precisely meeting the target system's security requirements, expressed as a lattice policy. We evaluate LPS in a simulated 8-core microprocessor. Compared to prior solutions [34], lattice priority scheduling improves system throughput by over 30% on average and by up to 84% for some workloads.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123381471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 109