{"title":"Delay-Hiding energy management mechanisms for DRAM","authors":"Mingsong Bi, Ran Duan, C. Gniady","doi":"10.1109/HPCA.2010.5416646","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416646","url":null,"abstract":"Current trends in data-intensive applications increase the demand for larger physical memory, resulting in the memory subsystem consuming a significant portion of system's energy. Furthermore, data-intensive applications heavily rely on a large buffer cache that occupies a majority of physical memory. Subsequently, we are focusing on the power management for physical memory dedicated to the buffer cache. Several techniques have been proposed to reduce energy consumption by transitioning DRAM into low-power states. However, transitions between different power states incur delays and may affect whole system performance. We take advantage of the I/O handling routines in the OS kernel to hide the delay incurred by the memory state transition so that performance degradation is minimized while maintaining high memory energy savings. Our evaluation shows that the best of the proposed mechanisms hides almost all transition latencies while only consuming 3% more energy as compared to the existing on-demand mechanism, which can expose significant delays.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116185904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Libo Huang, Li Shen, Zhiying Wang, Wei Shi, Nong Xiao, Sheng Ma
{"title":"SIF: Overcoming the limitations of SIMD devices via implicit permutation","authors":"Libo Huang, Li Shen, Zhiying Wang, Wei Shi, Nong Xiao, Sheng Ma","doi":"10.1109/HPCA.2010.5416631","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416631","url":null,"abstract":"SIMD devices have gained widespread acceptance in modern microprocessor designs for their superior performance for multimedia applications. However, there are three remaining limitations to the efficient utilization of SIMD devices in general-purpose computer systems: memory alignment, data reorganization and control flow. This paper presents SIF, an efficient SIMD interface framework that addresses these three shortcomings without modifying existing ISA. It is designed around a permutation vector register file (PVRF) and it adds new extended instructions to set internal permutation state in SIMD datapath rather than putting the permutation state setting bits in every instruction. The implicit permutation capability provided by PVRF results in zero overhead, which frees the handling of three limitations by using permutation instructions. To further reduce the state setting instructions in SIMD datapath, a technique that moves the workloads from SIMD pipeline into scalar pipeline is also introduced. With the help of proposed compilation algorithm, SIF can efficiently transform regular SIMD codes into SIF codes which make it easily integrated in all existing SIMD devices. We implemented these techniques in a vectorizing compiler and experimental results show that most of the permutation overhead instructions can be eliminated and distinct performance speedup can be achieved, which is 37% higher than current SIMD techniques on average.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125996543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitris Kaseridis, Jeffrey Stuecheli, Jing Chen, L. John
{"title":"A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems","authors":"Dimitris Kaseridis, Jeffrey Stuecheli, Jing Chen, L. John","doi":"10.1109/HPCA.2010.5416654","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416654","url":null,"abstract":"By integrating multiple cores in a single chip, Chip Multiprocessors (CMP) provide an attractive approach to improve both system throughput and efficiency. This integration allows the sharing of on-chip resources which may lead to destructive interference between the executing workloads. Memorysubsystem is an important shared resource that contributes significantly to the overall throughput and power consumption. In order to prevent destructive interference, the cache capacity and memory bandwidth requirements of the last level cache have to be controlled. While previously proposed schemes focus on resource sharing within a chip, we explore additional possibilities both inside and outside a single chip. We propose a dynamic memory-subsystem resource management scheme that considers both cache capacity and memory bandwidth contention in large multi-chip CMP systems. Our approach uses low overhead, non-invasive resource profilers that are based on Mattson's stack distance algorithm to project each core's resource requirements and guide our cache partitioning algorithms. Our bandwidth-aware algorithm seeks for throughput optimizations among multiple chips by migrating workloads from the most resource-overcommitted chips to the ones with more available resources. Use of bandwidth as a criterion results in an overall 18% reduction in memory bandwidth along with a 7.9% reduction in miss rate, compared to existing resource management schemes. Using a cycle-accurate full system simulator, our approach achieved an average improvement of 8.5% on throughput.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115724226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interval simulation: Raising the level of abstraction in architectural simulation","authors":"Davy Genbrugge, Stijn Eyerman, L. Eeckhout","doi":"10.1109/HPCA.2010.5416636","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416636","url":null,"abstract":"Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner. This paper proposes interval simulation which takes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor. By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the M5 multi-core simulator, show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover, interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116213363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Malcolm S. Allen-Ware, K. Rajamani, M. Floyd, B. Brock, J. Rubio, F. Rawson, J. Carter
{"title":"Architecting for power management: The IBM® POWER7™ approach","authors":"Malcolm S. Allen-Ware, K. Rajamani, M. Floyd, B. Brock, J. Rubio, F. Rawson, J. Carter","doi":"10.1109/HPCA.2010.5416627","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416627","url":null,"abstract":"The POWER7 processor is the newest member of the IBM POWER® family of server processors. With greater than 4X the peak performance and the same power budget as the previous generation POWER6®, POWER7 will deliver impressive energy-efficiency boosts. The improved peak energy-efficiency is accompanied by a wide array of new features in the processor and system designs that advance IBM's EnergyScale™ dynamic power management methodology. This paper provides an overview of these new features, which include better sensing, more advanced power controls, improved scalability for power management, and features to address the diverse needs of the full range of POWER servers from blades to supercomputers. We also highlight three challenges that need attention from a range of systems design and research teams: (i) power management in highly virtualized environments, (ii) power (in)efficiency of systems software and applications, and (iii) memory power costs, especially for servers with large memory footprints.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":" 31","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132074539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable architectural support for trusted software","authors":"D. Champagne, R. Lee","doi":"10.1109/HPCA.2010.5416657","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416657","url":null,"abstract":"We present Bastion, a new hardware-software architecture for protecting security-critical software modules in an untrusted software stack. Our architecture is composed of enhanced microprocessor hardware and enhanced hypervisor software. Each trusted software module is provided with a secure, fine-grained memory compartment and its own secure persistent storage area. Bastion is the first architecture to provide direct hardware protection of the hypervisor from both software and physical attacks, before employing the hypervisor to provide the same protection to security-critical OS and application modules. Our implementation demonstrates the feasibility of bypassing an untrusted commodity OS to provide application security and shows better security with higher performance when compared to the Trusted Platform Module (TPM), the current industry state-of-the-art security chip. We provide a proof-of-concept implementation on the OpenSPARC platform.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114257414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth","authors":"Dong Hyuk Woo, N. Seong, D. L. Lewis, H. Lee","doi":"10.1109/HPCA.2010.5416628","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416628","url":null,"abstract":"Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120965657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simple virtual channel allocation for high throughput and high frequency on-chip routers","authors":"Yi Xu, Bo Zhao, Youtao Zhang, Jun Yang","doi":"10.1145/2742349","DOIUrl":"https://doi.org/10.1145/2742349","url":null,"abstract":"Technology scaling has led to the integration of many cores into a single chip. As a result, on-chip interconnection networks start to play a more and more important role in determining the performance and power of the entire chip. Packet-switched network-on-chip (NoC) has provided a scalable solution to the communications for tiled multi-core processors. However the virtual-channel (VC) buffers in the NoC consume significant dynamic and leakage power of the system. To improve the energy efficiency of the router design, it is advantageous to use small buffer sizes while still maintaining throughput of the network. This paper proposes two new virtual channel allocation (VA) mechanisms, termed Fixed VC Assignment with Dynamic VC Allocation (FVADA) and Adjustable VC Assignment with Dynamic VC Allocation (AVADA). The idea is that VCs are assigned based on the designated output port of a packet to reduce the Head-of-Line (HoL) blocking. Also, the number of VCs allocated for each output port can be adjusted dynamically. Unlike previous buffer-pool based designs, we only use a small number of VCs to keep the arbitration latency low. Simulation results show that FVADA and AVADA can improve the network throughput by 41% on average, compared to a baseline design with the same buffer size. AVADA can still outperform the baseline even when our buffer size is halved. Moreover, we are able to achieve comparable or better throughput than a previous dynamic VC allocator while reducing its critical path delay by 60%. Our results prove that the proposed VA mechanisms are suitable for low-power, high-throughput, and high-frequency on-chip network designs.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122007530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance","authors":"Fang Liu, Xiaowei Jiang, Yan Solihin","doi":"10.1109/HPCA.2010.5416655","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416655","url":null,"abstract":"Chip Multi-Processor (CMP) architectures have recently become a mainstream computing platform. Recent CMPs allow cores to share expensive resources, such as the last level cache and off-chip pin bandwidth. To improve system performance and reduce the performance volatility of individual threads, last level cache and off-chip bandwidth partitioning schemes have been proposed. While how cache partitioning affects system performance is well understood, little is understood regarding how bandwidth partitioning affects system performance, and how bandwidth and cache partitioning interact with one another. In this paper, we propose a simple yet powerful analytical model that gives us an ability to answer several important questions: (1) How does off-chip bandwidth partitioning improve system performance? (2) In what situations the performance improvement is high or low, and what factors determine that? (3) In what way cache and bandwidth partitioning interact, and is the interaction negative or positive? (4) Can a theoretically optimum bandwidth partition be derived, and if so, what factors affect it? We believe understanding the answers to these questions is very valuable to CMP system designers in coming up with strategies to deal with the scarcity of off-chip bandwidth in future CMPs with many cores on a chip.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"15 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133651110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution","authors":"Andrew D. Hilton, A. Roth","doi":"10.1109/HPCA.2010.5416634","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416634","url":null,"abstract":"LT (latency tolerant) execution is an attractive candidate technique for future out-of-order cores. LT defers the forward slices of LLC (last-level cache) misses to a slice buffer and re-executes them when the misses return. An LT core increases ILP without physically scaling the issue queue and register file and increases MLP without additional software threads that can reduce cache performance. Unfortunately, proposed LT designs are not energy efficient. They require too many additional structures and they defer and re-execute too many instructions to justify their performance gains. In this paper, we address these inefficiencies. We introduce a microarchitecture called BOLT (Better Out-of-Order Latency-Tolerance) that implements LT as an alternative use of SMT (Simultaneous Multi-Threading). We also present a new slice buffer organization and traversal scheme that increases performance and reduces overhead by pruning instances of useless and redundant LT. Collectively, these modifications turn out-of-order LT into a technique that improves performance in an energy-efficient way.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"966 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116442908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}