2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)最新文献

筛选
英文 中文
Domain knowledge based energy management in handhelds 基于领域知识的手持设备能量管理
N. Nachiappan, Praveen Yedlapalli, N. Soundararajan, A. Sivasubramaniam, M. Kandemir, Ravishankar R. Iyer, C. Das
{"title":"Domain knowledge based energy management in handhelds","authors":"N. Nachiappan, Praveen Yedlapalli, N. Soundararajan, A. Sivasubramaniam, M. Kandemir, Ravishankar R. Iyer, C. Das","doi":"10.1109/HPCA.2015.7056029","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056029","url":null,"abstract":"Energy management in handheld devices is becoming a daunting task with the growing number of accelerators, increasing memory demands and high computing capacities required to support applications with stringent QoS needs. Current DVFS techniques that modulate power states of a single hardware component, or even recent proposals that manage multiple components, can lose out opportunities for attaining high energy efficiencies that may be possible by leveraging application domain knowledge. Thus, this paper proposes a coordinated multi-component energy optimization mechanism for handheld devices, where the energy profile of different components such as CPU, memory, GPU and IP cores are considered in unison to trigger the appropriate DVFS state by exploiting the application domain knowledge. Specifically, we show that for the important class of frame-based applications, the domain knowledge - frame processing rates, component utilization and available slack - can be used to decide effective DVFS states for each component from among the numerous choices. With such knowledge, rather than a brute force search of all speed setting choices, we propose two simpler heuristics, called Greedy policy and Kaldor-Hicks compensation policy, to make the decisions at frame boundaries. Our evaluations with 7 commonly-used Android apps show that our domain-aware coordinated DVFS policies have 23% better energy efficiency than the conventionally used Android governors, and are within ~9% of an optimal policy that does not drop any frames.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"150-160"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77240127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches 用于服务器工作负载的高性能缓存层次结构:放松包含以捕获独占缓存的延迟优势
A. Jaleel, Joseph Nuzman, Adrian Moga, S. Steely, J. Emer
{"title":"High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches","authors":"A. Jaleel, Joseph Nuzman, Adrian Moga, S. Steely, J. Emer","doi":"10.1109/HPCA.2015.7056045","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056045","url":null,"abstract":"Increasing transistor density enables adding more on-die cache real-estate However, devoting more space to the shared last-level-cache (LLC) causes the memory latency bottleneck to move from memory access latency to shared cache access latency. As such, applications whose working set is larger than the smaller caches spend a large fraction of their execution time on shared cache access latency. To address this problem, this paper investigates increasing the size of smaller private caches in the hierarchy as opposed to increasing the shared LLC. Doing so improves average cache access latency for workloads whose working set fits into the larger private cache while retaining the benefits of a shared LLC. The consequence of increasing the size of private caches is to relax inclusion and build exclusive hierarchies. Thus, for the same total caching capacity, an exclusive cache hierarchy provides better cache access latency. We observe that server workloads benefit tremendously from an exclusive hierarchy with large private caches. This is primarily because large private caches accommodate the large code working-sets of server workloads. For a 16-core CMP, an exclusive cache hierarchy improves server workload performance by 5-12% as compared to an equal capacity inclusive cache hierarchy. The paper also presents directions for further research to maximize performance of exclusive cache hierarchies.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"2 1","pages":"343-353"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74694767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Power punch: Towards non-blocking power-gating of NoC routers 电源打孔:面向NoC路由器的无阻塞电源门控
Lizhong Chen, Di Zhu, Massoud Pedram, T. Pinkston
{"title":"Power punch: Towards non-blocking power-gating of NoC routers","authors":"Lizhong Chen, Di Zhu, Massoud Pedram, T. Pinkston","doi":"10.1109/HPCA.2015.7056048","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056048","url":null,"abstract":"As chip designs penetrate further into the dark silicon era, innovative techniques are much needed to power off idle or under-utilized system components while having minimal impact on performance. On-chip network routers are potentially good targets for power-gating, but packets in the network can be significantly delayed as their paths may be blocked by powered-off routers. In this paper, we propose Power Punch, a novel performance-aware, power reduction scheme that aims to achieve non-blocking power-gating of on-chip network routers. Two mechanisms are proposed that not only allow power control signals to utilize existing slack at source nodes to wake up powered-off routers along the first few hops before packets are injected, but also allow these signals to utilize hop count slack by staying ahead of packets to \"punch through \" any blocked routers along the imminent path of packets, preventing packets from having to suffer router wakeup latency or packet detour latency. Full system evaluation on PARSEC benchmarks shows Power Punch saves more than 83% of router static energy while having an execution time penalty of less than 0.4%, effectively achieving near non-blocking power-gating of on-chip network routers.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"378-389"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90929391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 98
Coordinated static and dynamic cache bypassing for GPUs 协调静态和动态缓存绕过gpu
Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, Tao Wang
{"title":"Coordinated static and dynamic cache bypassing for GPUs","authors":"Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, Tao Wang","doi":"10.1109/HPCA.2015.7056023","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056023","url":null,"abstract":"The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches in conjunction with scratchpad memory as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to excessive thread contention for cache resource. Cache bypassing, where memory requests can selectively bypass the cache, is one solution that can help to mitigate the cache resource contention problem. In this paper, we propose coordinated static and dynamic cache bypassing to improve application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing through profiling. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. In CUDA programming model, the threads are divided into work units called thread blocks. Our dynamic bypassing technique modulates the ratio of thread blocks that cache or bypass at run-time. We choose to modulate at thread block level in order to avoid the memory divergence problems. Our approach combines compile-time analysis that determines the cache or bypass preferences for global loads with run-time management that adjusts the ratio of thread blocks that cache or bypass. Our coordinated static and dynamic cache bypassing technique achieves up to 2.28X (average 1.32X) performance speedup for a variety of GPU applications.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"101 1","pages":"76-88"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85884768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 121
Increasing multicore system efficiency through intelligent bandwidth shifting 通过智能带宽转移提高多核系统效率
Víctor Jiménez, A. Buyuktosunoglu, P. Bose, F. O'Connell, F. Cazorla, M. Valero
{"title":"Increasing multicore system efficiency through intelligent bandwidth shifting","authors":"Víctor Jiménez, A. Buyuktosunoglu, P. Bose, F. O'Connell, F. Cazorla, M. Valero","doi":"10.1109/HPCA.2015.7056020","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056020","url":null,"abstract":"Memory bandwidth is a crucial resource in computing systems. Current CMP/SMT processors have a significant number of cores and they can run many threads concurrently. This large thread count adds high pressure to the memory bus, which demands high bandwidth to service memory requests from the cores. Hardware data prefetching is a well-known technique for hiding memory latency. Due to its speculative nature, however, in some situations prefetching does not effectively work, wasting memory bandwidth and polluting the caches. Data prefetching efficiency depends on the prefetching algorithm. It also depends on the characteristics of the applications running on the system. In this paper we propose an online bandwidth shifting mechanism that dynamically assigns bandwidth to applications according to their prefetch efficiency. This mechanism maximizes the utilization of memory bandwidth, thereby improving system performance and/or reducing memory power consumption. To the best of our knowledge, this solution is the first to not require hardware support. We evaluate the benefits of using our bandwidth shifting mechanism on a real system - the IBM POWER7. We obtain speedups in the order of 10-20% (in one instance, speedup exceeds 1.6X). Our mechanism does not generate a significant degree of unfairness among the applications. In many cases individual thread performance increases by 10-35%, while virtually no thread experiences a slowdown larger than 5%.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"55 1","pages":"39-50"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81337515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
iPatch: Intelligent fault patching to improve energy efficiency iPatch:智能故障补丁,提高能效
David J. Palframan, N. Kim, Mikko H. Lipasti
{"title":"iPatch: Intelligent fault patching to improve energy efficiency","authors":"David J. Palframan, N. Kim, Mikko H. Lipasti","doi":"10.1109/HPCA.2015.7056052","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056052","url":null,"abstract":"Dynamic voltage and frequency scaling can provide substantial energy savings but is limited by SRAM since some cells will fail at very low voltages. Due to process variation effects, a small subset of SRAM cells will be more sensitive to voltage reduction, requiring increased margins and limiting energy savings. Since large arrays like caches are most vulnerable to cell failures, recent proposals suggest disabling failing portions of the cache to enable low voltage operation. Although such approaches save power, energy reduction is limited because reducing the effective cache size increases program runtimes. In this paper, we present iPatch, a solution to regain this lost performance and enable energy savings by exploiting the redundancy inherent in superscalar processors. By relying on existing microarchitectural structures and mechanisms to \"patch\" the faulty parts of caches, we enable further energy reduction with minimal overhead and complexity. Furthermore, because no critical paths or circuits are affected by our implementation, there is no impact on normal-voltage operation. For high cell failure rates, our results show significant energy savings with iPatch as well as an 18% reduction in energy-delay product compared to prior work.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"80 1","pages":"428-438"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89639136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications 基于事件的移动Web应用节能QoS (eQoS)调度
Yuhao Zhu, Matthew Halpern, V. Reddi
{"title":"Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications","authors":"Yuhao Zhu, Matthew Halpern, V. Reddi","doi":"10.1109/HPCA.2015.7056028","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056028","url":null,"abstract":"Mobile Web applications have become an integral part of our society. They pose a high demand for application quality of service (QoS). However, the energy-constrained nature of mobile devices makes optimizing for QoS difficult. Prior art on energy efficiency optimizations has only focused on the trade-off between raw performance and energy consumption, ignoring the application QoS characteristics. In this paper, we propose the concept of energy-efficient QoS (eQoS) to capture the trade-off between QoS and energy consumption. Given the fundamental event-driven nature of mobile Web applications, we further propose event-based scheduling as an optimization framework for eQoS. The event-based scheduling automatically reasons about users' QoS requirements, and accurately slacks the events' execution time to save energy without violating end users' experience. We demonstrate a working prototype using the Google Chromium and V8 framework on the Samsung Exynos 5410 SoC (used in the Galaxy S4 smartphone). Based on real hardware and software measurements, we achieve 41.2% energy saving with only 0.4% of QoS violations perceptible to end users.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"30 1","pages":"137-149"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89485830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 98
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation 理解大规模高性能计算系统中的GPU错误以及对系统设计和操作的影响
Devesh Tiwari, Saurabh Gupta, James H. Rogers, Don E. Maxwell, P. Rech, Sudharshan S. Vazhkudai, Daniel Oliveira, Dave Londo, Nathan Debardeleben, P. Navaux, L. Carro, Arthur S. Bland
{"title":"Understanding GPU errors on large-scale HPC systems and the implications for system design and operation","authors":"Devesh Tiwari, Saurabh Gupta, James H. Rogers, Don E. Maxwell, P. Rech, Sudharshan S. Vazhkudai, Daniel Oliveira, Dave Londo, Nathan Debardeleben, P. Navaux, L. Carro, Arthur S. Bland","doi":"10.1109/HPCA.2015.7056044","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056044","url":null,"abstract":"Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the world's second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"58 1","pages":"331-342"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91082720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 138
Correction prediction: Reducing error correction latency for on-chip memories 修正预测:减少片上存储器的纠错延迟
Henry Duwe, Xun Jian, Rakesh Kumar
{"title":"Correction prediction: Reducing error correction latency for on-chip memories","authors":"Henry Duwe, Xun Jian, Rakesh Kumar","doi":"10.1109/HPCA.2015.7056055","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056055","url":null,"abstract":"The reliability of on-chip memories (e.g., caches) determines their minimum operating voltage (Vmin) and, therefore, the power these memories consume. A strong error correction mechanism can be used to tolerate the increasing memory cell failure rate as supply voltage is reduced. However, strong error correction often incurs a high latency relative to the on-chip memory access time. We propose correction prediction where a fast mechanism predicts the result of strong error correction to hide the long latency of correction. Subsequent pipeline stages execute using the predicted values while the long latency strong error correction attempts to verify the correctness of the predicted values in parallel. We present a simple correction prediction implementation, CP, which uses a fast, but weak error correction mechanism as the correction predictor. Our evaluations for a 32KB 4-way set associative SRAM L1 cache show that the proposed implementation, CP, reduces the average cache access latency by 38%-52% compared to using a strong error correction scheme alone. This reduces the energy of a 2-issue in-order core by 16%-21%.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"38 1","pages":"463-475"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88189372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Mascar: Speeding up GPU warps by reducing memory pitstops 马斯卡:通过减少内存站来加速GPU扭曲
Ankit Sethia, D. Jamshidi, S. Mahlke
{"title":"Mascar: Speeding up GPU warps by reducing memory pitstops","authors":"Ankit Sethia, D. Jamshidi, S. Mahlke","doi":"10.1109/HPCA.2015.7056031","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056031","url":null,"abstract":"With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly. Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs. While GPUs provide large numbers of compute resources, the resources needed for memory intensive workloads are more scarce. Therefore, managing access to these limited memory resources is a challenge for GPUs. We propose a novel Memory Aware Scheduling and Cache Access Re-execution (Mascar) system on GPUs tailored for better performance for memory intensive workloads. This scheme detects memory saturation and prioritizes memory requests among warps to enable better overlapping of compute and memory accesses. Furthermore, it enables limited re-execution of memory instructions to eliminate structural hazards in the memory subsystem and take advantage of cache locality in cases where requests cannot be sent to the memory due to memory saturation. Our results show that Mascar provides a 34% speedup over the baseline round-robin scheduler and 10% speedup over the state of the art warp schedulers for memory intensive workloads. Mascar also achieves an average of 12% savings in energy for such workloads.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"16 1","pages":"174-185"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76809705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 78
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信