IEEE International Symposium on High-Performance Comp Architecture最新文献_第3页

Adaptive Set-Granular Cooperative Caching 自适应集粒度协同缓存

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169028

D. Rolán, B. Fraguela, R. Doallo

{"title":"Adaptive Set-Granular Cooperative Caching","authors":"D. Rolán, B. Fraguela, R. Doallo","doi":"10.1109/HPCA.2012.6169028","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169028","url":null,"abstract":"Current Chip Multiprocessors (CMPs) consist of several cores, cache memories and interconnection networks in the same chip. Private last level cache (LLC) configurations assign a static portion of the LLC to each core. This provides lower latency and isolation, at the cost of depriving the system of the possibility of reassigning underutilized resources. A way of taking advantage of underutilized resources in other private LLCs in the same chip is to use the coherence mechanism to determine the state of those caches and spill lines to them. Also, it is well known that memory references are not uniformly distributed across the sets of a set-associative cache. Therefore, applying a uniform spilling policy to all the sets in a cache may not be the best option. This paper proposes Adaptive Set-Granular Cooperative Caching (ASCC), which measures the degree of stress of each set and performs spills between spiller and potential receiver sets, while it tackles capacity problems. Also, it adds a neutral state to prevent sets from being either spillers or receivers when it could be harmful. Furthermore, we propose Adaptive Variable-Granularity Cooperative Caching (AVGCC), which dynamically adjusts the granularity for applying these policies. Both techniques have a negligible storage overhead and can adapt to many core environments using scalable structures. AVGCC improved average performance by 7.8% and reduced average memory latency by 27% related to a traditional private LLC configuration in a 4-core CMP. Finally, we propose an extension of AVGCC to provide Quality of Service that increases the average performance gain to 8.1%.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123241290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

WEST: Cloning data cache behavior using Stochastic Traces WEST:使用随机跟踪克隆数据缓存行为

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169042

Ganesh Balakrishnan, Yan Solihin

{"title":"WEST: Cloning data cache behavior using Stochastic Traces","authors":"Ganesh Balakrishnan, Yan Solihin","doi":"10.1109/HPCA.2012.6169042","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169042","url":null,"abstract":"Cache designers need an in-depth understanding of end user workloads, but certain end users are apprehensive about sharing code or traces due to the proprietary or confidential nature of code and data. To bridge this gap, cache designers use a reduced representation of the code (a clone). A promising cloning approach is the black box approach, where workloads are profiled to obtain key statistics, and a clone is automatically generated. Despite its potential, currently there are no highly accurate black box cloning methods for replicating data cache behavior. We propose Workload Emulation using Stochastic Traces (WEST), a highly accurate black box cloning technique for replicating data cache behavior of arbitrary programs. First, we analyze what profiling statistics are necessary and sufficient to capture a workload. Then, we generate a clone stochastically that produces statistics identical to the proprietary workload. WEST clones can be used in lieu of the workload for exploring cache sizes, associativities, write policies, replacement policies, cache hierarchies and co-scheduling, at a significantly reduced simulation time. We use a simple IPC model to control the rate of accesses to the cache hierarchy. We evaluated WEST using CPU2006 and BioBench suites over a wide cache design space for single core and dual core CMPs. The clones achieve an average error in miss ratio of only 0.4% across 1394 single core cache configurations. For co-scheduled mixes, WEST achieves an average error in miss ratio of only 3.1% for over 600 configurations.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130290104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

MORSE: Multi-objective reconfigurable self-optimizing memory scheduler 多目标可重构自优化内存调度程序

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168945

Janani Mukundan, José F. Martínez

引用次数: 57

MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets 澳门:基于马尔可夫模型的单位和多位故障下高速缓存可靠性评估

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168940

Jinho Suh, M. Annavaram, M. Dubois

引用次数: 33

Improving write operations in MLC phase change memory 改进MLC相变存储器的写操作

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169027

Lei Jiang, Bo Zhao, Youtao Zhang, Jun Yang, B. Childers

{"title":"Improving write operations in MLC phase change memory","authors":"Lei Jiang, Bo Zhao, Youtao Zhang, Jun Yang, B. Childers","doi":"10.1109/HPCA.2012.6169027","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169027","url":null,"abstract":"Phase change memory (PCM) recently has emerged as a promising technology to meet the fast growing demand for large capacity memory in modern computer systems. In particular, multi-level cell (MLC) PCM that stores multiple bits in a single cell, offers high density with low per-byte fabrication cost. However, despite many advantages, such as good scalability and low leakage, PCM suffers from exceptionally slow write operations, which makes it challenging to be integrated in the memory hierarchy. In this paper, we propose architectural innovations to improve the access time of MLC PCM. Due to cell process variation, composition fluctuation and the relatively small differences among resistance levels, MLC PCM typically employs an iterative write scheme to achieve precise control, which suffers from large write access latency. To address this issue, we propose write truncation (WT) to reduce the number of write iterations with the assistance of an extra error correction code (ECC). We also propose form switch (FS) to reduce the storage overhead of the ECC. By storing highly compressible lines in SLC form, FS improves read latency as well. Our experimental results show that WT and FS improve the effective write/read latency by 57%/28% respectively, and achieve 26% performance improvement over the state of the art.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122715858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 170

Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications 准非易失性SSD:利用闪存的非易失性来提高企业应用程序的存储系统性能

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168954

Yangyang Pan, Guiqiang Dong, Qi Wu, Tong Zhang

引用次数: 90

TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture TAP:针对CPU-GPU异构架构的tlp感知缓存管理策略

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168947

Jaekyu Lee, Hyesoon Kim

{"title":"TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture","authors":"Jaekyu Lee, Hyesoon Kim","doi":"10.1109/HPCA.2012.6168947","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168947","url":null,"abstract":"Combining CPUs and GPUs on the same chip has become a popular architectural trend. However, these heterogeneous architectures put more pressure on shared resource management. In particular, managing the last-level cache (LLC) is very critical to performance. Lately, many researchers have proposed several shared cache management mechanisms, including dynamic cache partitioning and promotion-based cache management, but no cache management work has been done on CPU-GPU heterogeneous architectures. Sharing the LLC between CPUs and GPUs brings new challenges due to the different characteristics of CPU and GPGPU applications. Unlike most memory-intensive CPU benchmarks that hide memory latency with caching, many GPGPU applications hide memory latency by combining thread-level parallelism (TLP) and caching. In this paper, we propose a TLP-aware cache management policy for CPU-GPU heterogeneous architectures. We introduce a core-sampling mechanism to detect how caching affects the performance of a GPGPU application. Inspired by previous cache management schemes, Utility-based Cache Partitioning (UCP) and Re-Reference Interval Prediction (RRIP), we propose two new mechanisms: TAP-UCP and TAP-RRIP. TAP-UCP improves performance by 5% over UCP and 11% over LRU on 152 heterogeneous workloads, and TAP-RRIP improves performance by 9% over RRIP and 12% over LRU.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124885862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 121

QuickIA: Exploring heterogeneous architectures on real prototypes QuickIA:在真实原型上探索异构架构

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169046

Nagabhushan Chitlur, G. Srinivasa, Scott Hahn, Pankaj Gupta, D. Reddy, David A. Koufaty, P. Brett, Abirami Prabhakaran, Li Zhao, Nelson Ijih, S. Subhaschandra, Sabina Grover, Xiaowei Jiang, R. Iyer

{"title":"QuickIA: Exploring heterogeneous architectures on real prototypes","authors":"Nagabhushan Chitlur, G. Srinivasa, Scott Hahn, Pankaj Gupta, D. Reddy, David A. Koufaty, P. Brett, Abirami Prabhakaran, Li Zhao, Nelson Ijih, S. Subhaschandra, Sabina Grover, Xiaowei Jiang, R. Iyer","doi":"10.1109/HPCA.2012.6169046","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169046","url":null,"abstract":"Over the last decade, homogeneous multi-core processors emerged and became the de-facto approach for offering high parallelism, high performance and scalability for a wide range of platforms. We are now at an interesting juncture where several critical factors (smaller form factor devices, power challenges, need for specialization, etc) are guiding architects to consider heterogeneous chips and platforms for the next decade and beyond. Exploring heterogeneous architectures is challenging since it involves re-evaluating architecture options, OS implications and application development. In this paper, we describe these research challenges and then introduce a heterogeneous prototype platform called QuickIA that enables rapid exploration of heterogeneous architectures employing multiple generations of Intel processors for evaluating the implications of asymmetry and FPGAs to experiment with specialized processors or accelerators. We also show example case studies using the QuickIA research prototype to highlight its value in conducting heterogeneous architecture, OS and applications research.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115641591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 85

BulkSMT: Designing SMT processors for atomic-block execution BulkSMT:为原子块执行设计SMT处理器

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168952

Xuehai Qian, B. Sahelices, J. Torrellas

{"title":"BulkSMT: Designing SMT processors for atomic-block execution","authors":"Xuehai Qian, B. Sahelices, J. Torrellas","doi":"10.1109/HPCA.2012.6168952","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168952","url":null,"abstract":"Multiprocessor architectures that continuously execute atomic blocks (or chunks) of instructions can improve performance and software productivity. However, all of the prior proposals for such architectures assume single-context cores as building blocks - rather than the widely-used Simultaneous Multithreading (SMT) cores. As a result, they are underutilizing hardware resources. This paper presents the first SMT design that supports continuous chunked (or transactional) execution of its contexts. Our design, called BulkSMT, can be used either in a single-core processor or in a multicore of SMTs. We present a set of BulkSMT configurations with different cost and performance. We also describe the architectural primitives that enable chunked execution in an SMT core and in a multicore of SMTs. Our results, based on simulations of SPLASH-2 and PARSEC codes, show that BulkSMT supports chunked execution cost-effectively. In a 4-core multicore with eager chunked execution, BulkSMT reduces the execution time of the applications by an average of 26% compared to running on single-context cores. In a single core, the average reduction is 32%.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124148067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Parabix: Boosting the efficiency of text processing on commodity processors Parabix:提高商品处理器的文本处理效率

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169041

Dan Lin, Nigel Medforth, Kenneth S. Herdy, Arrvindh Shriraman, R. Cameron

{"title":"Parabix: Boosting the efficiency of text processing on commodity processors","authors":"Dan Lin, Nigel Medforth, Kenneth S. Herdy, Arrvindh Shriraman, R. Cameron","doi":"10.1109/HPCA.2012.6169041","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169041","url":null,"abstract":"Modern applications employ text files widely for providing data storage in a readable format for applications ranging from database systems to mobile phones. Traditional text processing tools are built around a byte-at-a-time sequential processing model that introduces significant branch and cache miss penalties. Recent work has explored an alternative, transposed representation of text, Parabix (Parallel Bit Streams), to accelerate scanning and parsing using SIMD facilities. This paper advocates and develops Parabix as a general framework and toolkit, describing the software toolchain and run-time support that allows applications to exploit modern SIMD instructions for high performance text processing. The goal is to generalize the techniques to ensure that they apply across a wide variety of applications and architectures. The toolchain enables the application developer to write constructs assuming unbounded character streams and Parabix's code translator generates code based on machine specifics (e.g., SIMD register widths). The general argument in support of Parabix technology is made by a detailed performance and energy study of XML parsing across a range of processor architectures. Parabix exploits intra-core SIMD hardware and demonstrates 2×-7× speedup and 4× improvement in energy efficiency when compared with two widely used conventional software parsers, Expat and Apache-Xerces. SIMD implementations across three generations of x86 processors are studied including the new SandyBridge. The 256-bit AVX technology in Intel SandyBridge is compared with the well established 128-bit SSE technology to analyze the benefits and challenges of 3-operand instruction formats and wider SIMD hardware. Finally, the XML program is partitioned into pipeline stages to demonstrate that thread-level parallelism enables the application to exploit SIMD units scattered across the different cores, achieving improved performance (2× on 4 cores) while maintaining single-threaded energy levels.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133066131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31