2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献_第5页

Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system 水银:一种快速和节能的多级电池相变存储系统

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749742

Madhura Joshi, Wangyuan Zhang, Tao Li

{"title":"Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system","authors":"Madhura Joshi, Wangyuan Zhang, Tao Li","doi":"10.1109/HPCA.2011.5749742","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749742","url":null,"abstract":"Phase Change Memory (PCM) is one of the most promising technologies among emerging non-volatile memories. PCM stores data in crystalline and amorphous phases of the GST material using large differences in their electrical resistivity. Although it is possible to design a high capacity memory system by storing multiple bits at intermediate levels between the highest and lowest resistance states of PCM, it is difficult to obtain the tight distribution required for accurate reading of the data. Moreover, the required programming latency and energy for a Multiple Level PCM (MLC-PCM) cell is not trivial and can act as a major hurdle in adopting multilevel PCM in a high-density memory architecture design. Furthermore, the effect of process variation (PV) on PCM cell exacerbates the variability in necessary programming current and hence the target resistance spread, leading to the demand for high-latency, multi-iteration-based programming-and-verify write schemes for MLC-PCM. PV-aware control of programming current, programming using staircase down current pulses and programming using increasing reset current pulses are some of the traditional techniques used to achieve optimum programming energy, write latency and accuracy, but they usually target on optimizing only one aspect of the design. In this paper, we address the high-write latency and process variation issues of MLC-PCM by introducing Mercury: A fast and energy efficient multi-level cell based phase change memory architecture. Mercury adapts the programming scheme of a multi-level PCM cell by taking into consideration the initial state of the cell, the target resistance to be programmed and the effect of process variation on the programming current profile of the cell. The proposed techniques act at circuit as well as microarchitecture levels. Simulation results show that Mercury achieves 10% saving in programming latency and 25% saving in programming energy for the PCM memory system compared to that of the traditional methods.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134645849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 86

A quantitative performance analysis model for GPU architectures GPU架构的定量性能分析模型

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749745

Yao Zhang, John Douglas Owens

{"title":"A quantitative performance analysis model for GPU architectures","authors":"Yao Zhang, John Douglas Owens","doi":"10.1109/HPCA.2011.5749745","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749745","url":null,"abstract":"We develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU's native instruction set, we can predict performance with a 5–15% error. To demonstrate the usefulness of the model, we analyze three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60% and 18% respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115417646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 287

Storage free confidence estimation for the TAGE branch predictor TAGE分支预测器的无存储置信度估计

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749750

André Seznec

{"title":"Storage free confidence estimation for the TAGE branch predictor","authors":"André Seznec","doi":"10.1109/HPCA.2011.5749750","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749750","url":null,"abstract":"For the past 15 years, it has been shown that confidence estimation of branch prediction can be used for various usages such as fetch gating or throttling for power saving or for controlling resource allocation policies in a SMT processor. In many proposals, using extra hardware and particularly storage tables for branch confidence estimators has been considered as a worthwhile silicon investment. The TAGE predictor presented in 2006 is so far considered as the state-of-the-art conditional branch predictor. In this paper, we show that very accurate confidence estimations can be done for the branch predictions performed by the TAGE predictor by simply observing the outputs of the predictor tables. Many confidence estimators proposed in the literature only discriminate between high confidence predictions and low confidence predictions. It has been recently pointed out that a more selective confidence discrimination could useful. We show that the observation of the outputs of the predictor tables is sufficient to grade the confidence in the branch predictions with a very good granularity. Moreover a slight modification of the predictor automaton allows to discriminate the prediction in three classes, low-confidence (with a misprediction rate in the 30 % range), medium confidence (with a misprediction rate in 8–12% range) and high confidence (with a misprediction rate lower than 1 %).","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121301545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing HAsim:基于fpga的高细节多核仿真，使用分时复用

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749747

Michael Pellauer, Michael Adler, M. Kinsy, A. Parashar, J. Emer

引用次数: 105

Architectural framework for supporting operating system survivability 支持操作系统生存性的体系结构框架

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749751

Xiaowei Jiang, Yan Solihin

引用次数: 6

Keynote address I: Programming the cloud 主题演讲1:云编程

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749711

J. Larus

引用次数: 0

Achieving uniform performance and maximizing throughput in the presence of heterogeneity 在异构存在的情况下实现统一的性能和最大的吞吐量

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749712

K. Rangan, Michael D. Powell, Gu-Yeon Wei, D. Brooks

{"title":"Achieving uniform performance and maximizing throughput in the presence of heterogeneity","authors":"K. Rangan, Michael D. Powell, Gu-Yeon Wei, D. Brooks","doi":"10.1109/HPCA.2011.5749712","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749712","url":null,"abstract":"Continued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations — the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126365352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Beyond block I/O: Rethinking traditional storage primitives 超越块I/O:重新思考传统的存储基元

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749738

Xiangyong Ouyang, D. Nellans, Robert Wipfel, David Flynn, D. Panda

{"title":"Beyond block I/O: Rethinking traditional storage primitives","authors":"Xiangyong Ouyang, D. Nellans, Robert Wipfel, David Flynn, D. Panda","doi":"10.1109/HPCA.2011.5749738","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749738","url":null,"abstract":"Over the last twenty years the interfaces for accessing persistent storage within a computer system have remained essentially unchanged. Simply put, seek, read and write have defined the fundamental operations that can be performed against storage devices. These three interfaces have endured because the devices within storage subsystems have not fundamentally changed since the invention of magnetic disks. Non-volatile (flash) memory (NVM) has recently become a viable enterprise grade storage medium. Initial implementations of NVM storage devices have chosen to export these same disk-based seek/read/write interfaces because they provide compatibility for legacy applications. We propose there is a new class of higher order storage primitives beyond simple block I/O that high performance solid state storage should support. One such primitive, atomic-write, batches multiple I/O operations into a single logical group that will be persisted as a whole or rolled back upon failure. By moving write-atomicity down the stack into the storage device, it is possible to significantly reduce the amount of work required at the application, filesystem, or operating system layers to guarantee the consistency and integrity of data. In this work we provide a proof of concept implementation of atomic-write on a modern solid state device that leverages the underlying log-based flash translation layer (FTL). We present an example of how database management systems can benefit from atomic-write by modifying the MySQL InnoDB transactional storage engine. Using this new atomic-write primitive we are able to increase system throughput by 33%, improve the 90th percentile transaction response time by 20%, and reduce the volume of data written from MySQL to the storage subsystem by as much as 43% on industry standard benchmarks, while maintaining ACID transaction semantics.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126699204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 138