{"title":"Adaptive Set-Granular Cooperative Caching","authors":"D. Rolán, B. Fraguela, R. Doallo","doi":"10.1109/HPCA.2012.6169028","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169028","url":null,"abstract":"Current Chip Multiprocessors (CMPs) consist of several cores, cache memories and interconnection networks in the same chip. Private last level cache (LLC) configurations assign a static portion of the LLC to each core. This provides lower latency and isolation, at the cost of depriving the system of the possibility of reassigning underutilized resources. A way of taking advantage of underutilized resources in other private LLCs in the same chip is to use the coherence mechanism to determine the state of those caches and spill lines to them. Also, it is well known that memory references are not uniformly distributed across the sets of a set-associative cache. Therefore, applying a uniform spilling policy to all the sets in a cache may not be the best option. This paper proposes Adaptive Set-Granular Cooperative Caching (ASCC), which measures the degree of stress of each set and performs spills between spiller and potential receiver sets, while it tackles capacity problems. Also, it adds a neutral state to prevent sets from being either spillers or receivers when it could be harmful. Furthermore, we propose Adaptive Variable-Granularity Cooperative Caching (AVGCC), which dynamically adjusts the granularity for applying these policies. Both techniques have a negligible storage overhead and can adapt to many core environments using scalable structures. AVGCC improved average performance by 7.8% and reduced average memory latency by 27% related to a traditional private LLC configuration in a 4-core CMP. Finally, we propose an extension of AVGCC to provide Quality of Service that increases the average performance gain to 8.1%.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123241290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WEST: Cloning data cache behavior using Stochastic Traces","authors":"Ganesh Balakrishnan, Yan Solihin","doi":"10.1109/HPCA.2012.6169042","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169042","url":null,"abstract":"Cache designers need an in-depth understanding of end user workloads, but certain end users are apprehensive about sharing code or traces due to the proprietary or confidential nature of code and data. To bridge this gap, cache designers use a reduced representation of the code (a clone). A promising cloning approach is the black box approach, where workloads are profiled to obtain key statistics, and a clone is automatically generated. Despite its potential, currently there are no highly accurate black box cloning methods for replicating data cache behavior. We propose Workload Emulation using Stochastic Traces (WEST), a highly accurate black box cloning technique for replicating data cache behavior of arbitrary programs. First, we analyze what profiling statistics are necessary and sufficient to capture a workload. Then, we generate a clone stochastically that produces statistics identical to the proprietary workload. WEST clones can be used in lieu of the workload for exploring cache sizes, associativities, write policies, replacement policies, cache hierarchies and co-scheduling, at a significantly reduced simulation time. We use a simple IPC model to control the rate of accesses to the cache hierarchy. We evaluated WEST using CPU2006 and BioBench suites over a wide cache design space for single core and dual core CMPs. The clones achieve an average error in miss ratio of only 0.4% across 1394 single core cache configurations. For co-scheduled mixes, WEST achieves an average error in miss ratio of only 3.1% for over 600 configurations.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130290104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MORSE: Multi-objective reconfigurable self-optimizing memory scheduler","authors":"Janani Mukundan, José F. Martínez","doi":"10.1109/HPCA.2012.6168945","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168945","url":null,"abstract":"We propose a systematic and general approach to designing self-optimizing memory schedulers that can target arbitrary figures of merit (e.g., performance, throughput, energy, fairness). Using our framework, we instantiate three memory schedulers that target three important metrics: performance and energy efficiency of parallel workloads, as well as throughput/fairness of multiprogrammed workloads. Our experiments show that the resulting hardware significantly outperforms the state of the art in all cases.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"148 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115019901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets","authors":"Jinho Suh, M. Annavaram, M. Dubois","doi":"10.1109/HPCA.2012.6168940","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168940","url":null,"abstract":"Due to the growing trend that a Single Event Upset (SEU) can cause spatial Multi-Bit Upsets (MBUs), the effects of spatial MBUs has recently become an important yet very challenging issue, especially in large, last-level caches (LLCs) protected by protection codes. In the presence of spatial MBUs, the strength of the protection codes becomes a critical design issue. Developing a reliability model that includes the cumulative effects of overlapping SBUs, temporal MBUs and spatial MBUs is a very challenging problem, especially when protection codes are active. In this paper, we introduce a new framework called MACAU. MACAU is based on a Markov chain model and can compute the intrinsic MTTFs of scrubbed caches as well as benchmark caches protected by various codes. MACAU is the first framework that quantifies the failure rates of caches due to the combined effects of SBUs, temporal MBUs and spatial MBUs.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121182921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Jiang, Bo Zhao, Youtao Zhang, Jun Yang, B. Childers
{"title":"Improving write operations in MLC phase change memory","authors":"Lei Jiang, Bo Zhao, Youtao Zhang, Jun Yang, B. Childers","doi":"10.1109/HPCA.2012.6169027","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169027","url":null,"abstract":"Phase change memory (PCM) recently has emerged as a promising technology to meet the fast growing demand for large capacity memory in modern computer systems. In particular, multi-level cell (MLC) PCM that stores multiple bits in a single cell, offers high density with low per-byte fabrication cost. However, despite many advantages, such as good scalability and low leakage, PCM suffers from exceptionally slow write operations, which makes it challenging to be integrated in the memory hierarchy. In this paper, we propose architectural innovations to improve the access time of MLC PCM. Due to cell process variation, composition fluctuation and the relatively small differences among resistance levels, MLC PCM typically employs an iterative write scheme to achieve precise control, which suffers from large write access latency. To address this issue, we propose write truncation (WT) to reduce the number of write iterations with the assistance of an extra error correction code (ECC). We also propose form switch (FS) to reduce the storage overhead of the ECC. By storing highly compressible lines in SLC form, FS improves read latency as well. Our experimental results show that WT and FS improve the effective write/read latency by 57%/28% respectively, and achieve 26% performance improvement over the state of the art.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122715858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications","authors":"Yangyang Pan, Guiqiang Dong, Qi Wu, Tong Zhang","doi":"10.1109/HPCA.2012.6168954","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168954","url":null,"abstract":"This paper advocates a quasi-nonvolatile solid-state drive (SSD) design strategy for enterprise applications. The basic idea is to trade data retention time of NAND flash memory for other system performance metrics including program/erase (P/E) cycling endurance and memory programming speed, and meanwhile use explicit internal data refresh to accommodate very short data retention time (e.g., few weeks or even days). We also propose SSD scheduling schemes to minimize the impact of internal data refresh on normal I/O requests. Based upon detailed memory cell device modeling and SSD system modeling, we carried out simulations that clearly show the potential of using this simple quasi-nonvolatile SSD design strategy to improve system cycling endurance and speed performance. We also performed detailed energy consumption estimation, which shows the energy consumption overhead induced by data refresh is negligible.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122224661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture","authors":"Jaekyu Lee, Hyesoon Kim","doi":"10.1109/HPCA.2012.6168947","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168947","url":null,"abstract":"Combining CPUs and GPUs on the same chip has become a popular architectural trend. However, these heterogeneous architectures put more pressure on shared resource management. In particular, managing the last-level cache (LLC) is very critical to performance. Lately, many researchers have proposed several shared cache management mechanisms, including dynamic cache partitioning and promotion-based cache management, but no cache management work has been done on CPU-GPU heterogeneous architectures. Sharing the LLC between CPUs and GPUs brings new challenges due to the different characteristics of CPU and GPGPU applications. Unlike most memory-intensive CPU benchmarks that hide memory latency with caching, many GPGPU applications hide memory latency by combining thread-level parallelism (TLP) and caching. In this paper, we propose a TLP-aware cache management policy for CPU-GPU heterogeneous architectures. We introduce a core-sampling mechanism to detect how caching affects the performance of a GPGPU application. Inspired by previous cache management schemes, Utility-based Cache Partitioning (UCP) and Re-Reference Interval Prediction (RRIP), we propose two new mechanisms: TAP-UCP and TAP-RRIP. TAP-UCP improves performance by 5% over UCP and 11% over LRU on 152 heterogeneous workloads, and TAP-RRIP improves performance by 9% over RRIP and 12% over LRU.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124885862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nagabhushan Chitlur, G. Srinivasa, Scott Hahn, Pankaj Gupta, D. Reddy, David A. Koufaty, P. Brett, Abirami Prabhakaran, Li Zhao, Nelson Ijih, S. Subhaschandra, Sabina Grover, Xiaowei Jiang, R. Iyer
{"title":"QuickIA: Exploring heterogeneous architectures on real prototypes","authors":"Nagabhushan Chitlur, G. Srinivasa, Scott Hahn, Pankaj Gupta, D. Reddy, David A. Koufaty, P. Brett, Abirami Prabhakaran, Li Zhao, Nelson Ijih, S. Subhaschandra, Sabina Grover, Xiaowei Jiang, R. Iyer","doi":"10.1109/HPCA.2012.6169046","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169046","url":null,"abstract":"Over the last decade, homogeneous multi-core processors emerged and became the de-facto approach for offering high parallelism, high performance and scalability for a wide range of platforms. We are now at an interesting juncture where several critical factors (smaller form factor devices, power challenges, need for specialization, etc) are guiding architects to consider heterogeneous chips and platforms for the next decade and beyond. Exploring heterogeneous architectures is challenging since it involves re-evaluating architecture options, OS implications and application development. In this paper, we describe these research challenges and then introduce a heterogeneous prototype platform called QuickIA that enables rapid exploration of heterogeneous architectures employing multiple generations of Intel processors for evaluating the implications of asymmetry and FPGAs to experiment with specialized processors or accelerators. We also show example case studies using the QuickIA research prototype to highlight its value in conducting heterogeneous architecture, OS and applications research.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115641591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BulkSMT: Designing SMT processors for atomic-block execution","authors":"Xuehai Qian, B. Sahelices, J. Torrellas","doi":"10.1109/HPCA.2012.6168952","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168952","url":null,"abstract":"Multiprocessor architectures that continuously execute atomic blocks (or chunks) of instructions can improve performance and software productivity. However, all of the prior proposals for such architectures assume single-context cores as building blocks - rather than the widely-used Simultaneous Multithreading (SMT) cores. As a result, they are underutilizing hardware resources. This paper presents the first SMT design that supports continuous chunked (or transactional) execution of its contexts. Our design, called BulkSMT, can be used either in a single-core processor or in a multicore of SMTs. We present a set of BulkSMT configurations with different cost and performance. We also describe the architectural primitives that enable chunked execution in an SMT core and in a multicore of SMTs. Our results, based on simulations of SPLASH-2 and PARSEC codes, show that BulkSMT supports chunked execution cost-effectively. In a 4-core multicore with eager chunked execution, BulkSMT reduces the execution time of the applications by an average of 26% compared to running on single-context cores. In a single core, the average reduction is 32%.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124148067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Lin, Nigel Medforth, Kenneth S. Herdy, Arrvindh Shriraman, R. Cameron
{"title":"Parabix: Boosting the efficiency of text processing on commodity processors","authors":"Dan Lin, Nigel Medforth, Kenneth S. Herdy, Arrvindh Shriraman, R. Cameron","doi":"10.1109/HPCA.2012.6169041","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169041","url":null,"abstract":"Modern applications employ text files widely for providing data storage in a readable format for applications ranging from database systems to mobile phones. Traditional text processing tools are built around a byte-at-a-time sequential processing model that introduces significant branch and cache miss penalties. Recent work has explored an alternative, transposed representation of text, Parabix (Parallel Bit Streams), to accelerate scanning and parsing using SIMD facilities. This paper advocates and develops Parabix as a general framework and toolkit, describing the software toolchain and run-time support that allows applications to exploit modern SIMD instructions for high performance text processing. The goal is to generalize the techniques to ensure that they apply across a wide variety of applications and architectures. The toolchain enables the application developer to write constructs assuming unbounded character streams and Parabix's code translator generates code based on machine specifics (e.g., SIMD register widths). The general argument in support of Parabix technology is made by a detailed performance and energy study of XML parsing across a range of processor architectures. Parabix exploits intra-core SIMD hardware and demonstrates 2×-7× speedup and 4× improvement in energy efficiency when compared with two widely used conventional software parsers, Expat and Apache-Xerces. SIMD implementations across three generations of x86 processors are studied including the new SandyBridge. The 256-bit AVX technology in Intel SandyBridge is compared with the well established 128-bit SSE technology to analyze the benefits and challenges of 3-operand instruction formats and wider SIMD hardware. Finally, the XML program is partitioned into pipeline stages to demonstrate that thread-level parallelism enables the application to exploit SIMD units scattered across the different cores, achieving improved performance (2× on 4 cores) while maintaining single-threaded energy levels.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133066131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}