HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture最新文献_第3页

Delay-Hiding energy management mechanisms for DRAM DRAM的延迟隐藏能量管理机制

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416646

Mingsong Bi, Ran Duan, C. Gniady

引用次数: 24

SIF: Overcoming the limitations of SIMD devices via implicit permutation SIF:通过隐式排列克服SIMD器件的限制

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416631

Libo Huang, Li Shen, Zhiying Wang, Wei Shi, Nong Xiao, Sheng Ma

{"title":"SIF: Overcoming the limitations of SIMD devices via implicit permutation","authors":"Libo Huang, Li Shen, Zhiying Wang, Wei Shi, Nong Xiao, Sheng Ma","doi":"10.1109/HPCA.2010.5416631","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416631","url":null,"abstract":"SIMD devices have gained widespread acceptance in modern microprocessor designs for their superior performance for multimedia applications. However, there are three remaining limitations to the efficient utilization of SIMD devices in general-purpose computer systems: memory alignment, data reorganization and control flow. This paper presents SIF, an efficient SIMD interface framework that addresses these three shortcomings without modifying existing ISA. It is designed around a permutation vector register file (PVRF) and it adds new extended instructions to set internal permutation state in SIMD datapath rather than putting the permutation state setting bits in every instruction. The implicit permutation capability provided by PVRF results in zero overhead, which frees the handling of three limitations by using permutation instructions. To further reduce the state setting instructions in SIMD datapath, a technique that moves the workloads from SIMD pipeline into scalar pipeline is also introduced. With the help of proposed compilation algorithm, SIF can efficiently transform regular SIMD codes into SIF codes which make it easily integrated in all existing SIMD devices. We implemented these techniques in a vectorizing compiler and experimental results show that most of the permutation overhead instructions can be eliminated and distinct performance speedup can be achieved, which is 37% higher than current SIMD techniques on average.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125996543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems 使用非侵入性资源分析器的大型CMP系统的带宽感知内存子系统资源管理

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416654

Dimitris Kaseridis, Jeffrey Stuecheli, Jing Chen, L. John

{"title":"A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems","authors":"Dimitris Kaseridis, Jeffrey Stuecheli, Jing Chen, L. John","doi":"10.1109/HPCA.2010.5416654","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416654","url":null,"abstract":"By integrating multiple cores in a single chip, Chip Multiprocessors (CMP) provide an attractive approach to improve both system throughput and efficiency. This integration allows the sharing of on-chip resources which may lead to destructive interference between the executing workloads. Memorysubsystem is an important shared resource that contributes significantly to the overall throughput and power consumption. In order to prevent destructive interference, the cache capacity and memory bandwidth requirements of the last level cache have to be controlled. While previously proposed schemes focus on resource sharing within a chip, we explore additional possibilities both inside and outside a single chip. We propose a dynamic memory-subsystem resource management scheme that considers both cache capacity and memory bandwidth contention in large multi-chip CMP systems. Our approach uses low overhead, non-invasive resource profilers that are based on Mattson's stack distance algorithm to project each core's resource requirements and guide our cache partitioning algorithms. Our bandwidth-aware algorithm seeks for throughput optimizations among multiple chips by migrating workloads from the most resource-overcommitted chips to the ones with more available resources. Use of bandwidth as a criterion results in an overall 18% reduction in memory bandwidth along with a 7.9% reduction in miss rate, compared to existing resource management schemes. Using a cycle-accurate full system simulator, our approach achieved an average improvement of 8.5% on throughput.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115724226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Interval simulation: Raising the level of abstraction in architectural simulation 区间模拟:提高架构模拟的抽象层次

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416636

Davy Genbrugge, Stijn Eyerman, L. Eeckhout

{"title":"Interval simulation: Raising the level of abstraction in architectural simulation","authors":"Davy Genbrugge, Stijn Eyerman, L. Eeckhout","doi":"10.1109/HPCA.2010.5416636","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416636","url":null,"abstract":"Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner. This paper proposes interval simulation which takes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor. By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the M5 multi-core simulator, show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover, interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116213363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 128

Architecting for power management: The IBM® POWER7™ approach 电源管理架构:IBM®POWER7™方法

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416627

Malcolm S. Allen-Ware, K. Rajamani, M. Floyd, B. Brock, J. Rubio, F. Rawson, J. Carter

引用次数: 162

Scalable architectural support for trusted software 可伸缩的可信软件架构支持

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416657

D. Champagne, R. Lee

引用次数: 183

An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth 通过利用过多的高密度TSV带宽，优化了3d堆叠内存架构

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416628

Dong Hyuk Woo, N. Seong, D. L. Lewis, H. Lee

{"title":"An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth","authors":"Dong Hyuk Woo, N. Seong, D. L. Lewis, H. Lee","doi":"10.1109/HPCA.2010.5416628","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416628","url":null,"abstract":"Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120965657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 255

Simple virtual channel allocation for high throughput and high frequency on-chip routers 简单的虚拟通道分配高吞吐量和高频片上路由器

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1145/2742349

Yi Xu, Bo Zhao, Youtao Zhang, Jun Yang

{"title":"Simple virtual channel allocation for high throughput and high frequency on-chip routers","authors":"Yi Xu, Bo Zhao, Youtao Zhang, Jun Yang","doi":"10.1145/2742349","DOIUrl":"https://doi.org/10.1145/2742349","url":null,"abstract":"Technology scaling has led to the integration of many cores into a single chip. As a result, on-chip interconnection networks start to play a more and more important role in determining the performance and power of the entire chip. Packet-switched network-on-chip (NoC) has provided a scalable solution to the communications for tiled multi-core processors. However the virtual-channel (VC) buffers in the NoC consume significant dynamic and leakage power of the system. To improve the energy efficiency of the router design, it is advantageous to use small buffer sizes while still maintaining throughput of the network. This paper proposes two new virtual channel allocation (VA) mechanisms, termed Fixed VC Assignment with Dynamic VC Allocation (FVADA) and Adjustable VC Assignment with Dynamic VC Allocation (AVADA). The idea is that VCs are assigned based on the designated output port of a packet to reduce the Head-of-Line (HoL) blocking. Also, the number of VCs allocated for each output port can be adjusted dynamically. Unlike previous buffer-pool based designs, we only use a small number of VCs to keep the arbitration latency low. Simulation results show that FVADA and AVADA can improve the network throughput by 41% on average, compared to a baseline design with the same buffer size. AVADA can still outperform the baseline even when our buffer size is halved. Moreover, we are able to achieve comparable or better throughput than a previous dynamic VC allocator while reducing its critical path delay by 60%. Our results prove that the proposed VA mechanisms are suitable for low-power, high-throughput, and high-frequency on-chip network designs.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122007530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance 了解芯片多处理器中的片外内存带宽分区如何影响系统性能

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416655

Fang Liu, Xiaowei Jiang, Yan Solihin

{"title":"Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance","authors":"Fang Liu, Xiaowei Jiang, Yan Solihin","doi":"10.1109/HPCA.2010.5416655","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416655","url":null,"abstract":"Chip Multi-Processor (CMP) architectures have recently become a mainstream computing platform. Recent CMPs allow cores to share expensive resources, such as the last level cache and off-chip pin bandwidth. To improve system performance and reduce the performance volatility of individual threads, last level cache and off-chip bandwidth partitioning schemes have been proposed. While how cache partitioning affects system performance is well understood, little is understood regarding how bandwidth partitioning affects system performance, and how bandwidth and cache partitioning interact with one another. In this paper, we propose a simple yet powerful analytical model that gives us an ability to answer several important questions: (1) How does off-chip bandwidth partitioning improve system performance? (2) In what situations the performance improvement is high or low, and what factors determine that? (3) In what way cache and bandwidth partitioning interact, and is the interaction negative or positive? (4) Can a theoretically optimum bandwidth partition be derived, and if so, what factors affect it? We believe understanding the answers to these questions is very valuable to CMP system designers in coming up with strategies to deal with the scarcity of off-chip bandwidth in future CMPs with many cores on a chip.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"15 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133651110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 90

BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution BOLT:节能的乱序容忍延迟执行

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416634

Andrew D. Hilton, A. Roth

引用次数: 28