V. Petrucci, M. Laurenzano, J. Doherty, Yunqi Zhang, D. Mossé, Jason Mars, Lingjia Tang
{"title":"Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers","authors":"V. Petrucci, M. Laurenzano, J. Doherty, Yunqi Zhang, D. Mossé, Jason Mars, Lingjia Tang","doi":"10.1109/HPCA.2015.7056037","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056037","url":null,"abstract":"Heterogeneous multicore architectures have the potential to improve energy efficiency by integrating power-efficient wimpy cores with high-performing brawny cores. However, it is an open question as how to deliver energy reduction while ensuring the quality of service (QoS) of latency-sensitive web-services running on such heterogeneous multicores in warehouse-scale computers (WSCs). In this work, we first investigate the implications of heterogeneous multicores in WSCs and show that directly adopting heterogeneous multicores without re-designing the software stack to provide QoS management leads to significant QoS violations. We then present Octopus-Man, a novel QoS-aware task management solution that dynamically maps latency-sensitive tasks to the least power-hungry processing resources that are sufficient to meet the QoS requirements. Using carefully-designed feedback-control mechanisms, Octopus-Man addresses critical challenges that emerge due to uncertainties in workload fluctuations and adaptation dynamics in a real system. Our evaluation using web-search and memcached running on a real-system Intel heterogeneous prototype demonstrates that Octopus-Man improves energy efficiency by up to 41% (CPU power) and up to 15% (system power) over an all-brawny WSC design while adhering to specified QoS targets.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"41 1","pages":"246-258"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75404411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O'Connor, M. Erez, D. Burger, D. Fussell, Stephen W. Redder
{"title":"Priority-based cache allocation in throughput processors","authors":"Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O'Connor, M. Erez, D. Burger, D. Fussell, Stephen W. Redder","doi":"10.1109/HPCA.2015.7056024","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056024","url":null,"abstract":"GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and off-chip memory bandwidth. This paper proposes to tightly couple the thread scheduling mechanism with the cache management algorithms such that GPU cache pollution is minimized while off-chip memory throughput is enhanced. We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. By tuning thread-level parallelism while both optimizing caching efficiency as well as other shared resource usage, PCAL builds upon previous thread throttling approaches, improving overall performance by an average 17% with maximum 51%.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"26 1","pages":"89-100"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73788386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scaling distributed cache hierarchies through computation and data co-scheduling","authors":"Nathan Beckmann, Po-An Tsai, Daniel Sánchez","doi":"10.1109/HPCA.2015.7056061","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056061","url":null,"abstract":"Cache hierarchies are increasingly non-uniform, so for systems to scale efficiently, data must be close to the threads that use it. Moreover, cache capacity is limited and contended among threads, introducing complex capacity/latency tradeoffs. Prior NUCA schemes have focused on managing data to reduce access latency, but have ignored thread placement; and applying prior NUMA thread placement schemes to NUCA is inefficient, as capacity, not bandwidth, is the main constraint. We present CDCS, a technique to jointly place threads and data in multicores with distributed shared caches. We develop novel monitoring hardware that enables fine-grained space allocation on large caches, and data movement support to allow frequent full-chip reconfigurations. On a 64-core system, CDCS outperforms an S-NUCA LLC by 46% on average (up to 76%) in weighted speedup and saves 36% of system energy. CDCS also outperforms state-of-the-art NUCA schemes under different thread scheduling policies.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"38 1","pages":"538-550"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85258433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Xi, H. Jacobson, P. Bose, Gu-Yeon Wei, D. Brooks
{"title":"Quantifying sources of error in McPAT and potential impacts on architectural studies","authors":"S. Xi, H. Jacobson, P. Bose, Gu-Yeon Wei, D. Brooks","doi":"10.1109/HPCA.2015.7056064","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056064","url":null,"abstract":"Architectural power modeling tools are widely used by the computer architecture community for rapid evaluations of high-level design choices and design space explorations. Currently, McPAT [31] is the de facto power model, but the literature does not yet contain a careful examination of its modeling accuracy. In addition, the issue of how greatly power modeling error can affect architectural-level studies has not been quantified before. In this work, we present the first rigorous assessment of McPAT's core power and area models with a detailed, validated power modeling toolchain used in current industrial practice. We find that McPAT's predictions can have significant error because some of the models are either incomplete, too high-level, or assume implementations of structures that differ from that of the core at hand. We demonstrate that large errors are possible when using McPAT's dynamic power estimates in the context of voltage noise and thermal hotspots, but for steady-state properties, accurately modeling leakage power is more important. Based on our analysis, we are able to provide guidelines for creating accurate McPAT models, even without access to detailed industrial power modeling tools. We conclude that in spite of its accuracy gaps, McPAT is still a very useful tool for many architectural studies, and its limitations can often be adequately addressed for a given research study of interest.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"59 1","pages":"577-589"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85734815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies","authors":"Alberto Ros, Mahdad Davari, S. Kaxiras","doi":"10.1109/HPCA.2015.7056032","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056032","url":null,"abstract":"Hierarchical clustered cache designs are becoming an appealing alternative for multicores. Grouping cores and their caches in clusters reduces network congestion by localizing traffic among several hierarchical levels, potentially enabling much higher scalability. While such architectures can be formed recursively by replicating a base design pattern, keeping the whole hierarchy coherent requires more effort and consideration. The reason is that, in hierarchical coherence, even basic operations must be recursive. As a consequence, intermediate-level caches behave both as directories and as leaf caches. This leads to an explosion of states, protocol-races, and protocol complexity. While there have been previous efforts to extend directory-based coherence to hierarchical designs their increased complexity and verification cost is a serious impediment to their adoption. We aim to address these concerns by encapsulating all hierarchical complexity in a simple function: that of determining when a data block is shared entirely within a cluster (sub-tree of the hierarchy) and is private from the outside. This allows us to eliminate complex recursive operations that span the hierarchy and instead employ simple coherence mechanisms such as self-invalidation and write-through - now restricted to operate within the cluster where a data block is shared. We examine two inclusivity options and discuss the relation of our approach to the recently proposed Hierarchical-Race-Free (HRF) memory models. Finally, comparisons to a hierarchical directory-based MOESI, VIPS-M, and TokenCMP protocols show that, despite its simplicity our approach results in competitive performance and decreased network traffic.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"186-197"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79610273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Wang, ChangMin Park, Gyungsu Byun, Jung Ho Ahn, N. Kim
{"title":"Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems","authors":"Hao Wang, ChangMin Park, Gyungsu Byun, Jung Ho Ahn, N. Kim","doi":"10.1109/HPCA.2015.7056041","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056041","url":null,"abstract":"A single-chip heterogeneous processor integrates both CPU and GPU on the same chip, demanding higher memory bandwidth. However, the current parallel interface (e.g., DDR3) can increase neither the number of (memory) channels nor the bit rate of the channels without paying high package and power costs. In contrast, the high-speed serial interface (HSI) can offer much higher bandwidth for the same number of pins and lower power consumption for the same bandwidth than the parallel interface. This allows us to integrate more channels under a pin and/or package power constraint but at the cost of longer latency for memory accesses and higher static energy consumption in particular for idle channels. In this paper, we first provide a deep understanding of recent HSI exhibiting very distinct characteristics from past serial interfaces in terms of bit rate, latency, energy per bit transfer, and static power consumption. To overcome the limitation of using only parallel or serial interfaces, we second propose a hybrid memory channel architecture-Alloy consisting of low-latency parallel and high-bandwidth serial channels. Alloy is assisted by our two proposed techniques: (i), a memory channel partitioning technique adoptively maps physical (memory) pages of latency-sensitive (CPU) and bandwidth-consuming (GPU) applications to parallel and serial channels, respectively, and (ii) a power management technique reduces the static energy consumption of idle serial channels. On average, Alloy provides 21% and 32% higher performance for CPU and GPU, respectively, while consuming total memory interface energy comparable to the baseline parallel channel architecture for diverse mixes of co-running CPU and GPU applications.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"707 1","pages":"296-308"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76899432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Son, Sukhan Lee, O. Seongil, Sanghyuk Kwon, N. Kim, Jung Ho Ahn
{"title":"CiDRA: A cache-inspired DRAM resilience architecture","authors":"Y. Son, Sukhan Lee, O. Seongil, Sanghyuk Kwon, N. Kim, Jung Ho Ahn","doi":"10.1109/HPCA.2015.7056058","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056058","url":null,"abstract":"Although aggressive technology scaling has allowed manufacturers to integrate Giga bits of cells into a cost-sensitive main memory DRAM device, these cells have become more defect-prone. With increased cell failure rates, conventional solutions such as populating spare DRAM rows and relying on error-correcting codes (ECCs) have shown limited success due to high area overhead, the latency penalties of data coding, and interference between ECC within a device (in-DRAM ECC) and other ECC across devices (rank-level ECC). In this paper, we propose CiDRA, a cache-inspired DRAM resilience architecture, which substantially reduces the area and latency overheads of correcting bit errors on random locations due to these faulty cells. We put a small SRAM cache within a DRAM device to replace accesses to the addresses including the faulty cells with ones that correspond to the cache data array. This CiDRA cache is paired with a Bloom filter to minimize the energy overhead of accessing the cache tags for every DRAM access and is also partitioned into small pieces, each being associated with the I/O pads for better area efficiency. Both the cache and DRAM banks are accessed in parallel while the banks are much slower. Consequently, the cache and filter are not in the critical path for normal DRAM accesses and incur no latency overhead. We also enhance the traditional in-DRAM ECC with error position bits and the appropriate error detecting capability while preventing interference with the traditional rank-level ECC scheme. By combining this enhanced in-DRAM ECC with the cache and Bloom filter, CiDRA becomes more area efficient because the in-DRAM ECC corrects most bit errors that are sporadic while the cache deals with the remaining relatively few pathological cases.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"30 1","pages":"502-513"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79211627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures","authors":"Jingwen Leng, Yazhou Zu, V. Reddi","doi":"10.1109/HPCA.2015.7056030","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056030","url":null,"abstract":"Energy efficiency is undoubtedly important for GPU architectures. Besides the traditionally explored energy-efficiency optimization techniques, exploiting the supply voltage guard-band remains a promising yet unexplored opportunity. Our hardware measurements show that up to 23% of the nominal supply voltage can be eliminated to improve CPU energy efficiency by as much as 25%. The key obstacle for exploiting this opportunity lies in understanding the characteristics and root causes of large voltage droops in GPU architectures and subsequently smoothing them away without severe performance penalties. The CPU's manycore nature complicates the voltage noise phenomenon, and its distinctive architecture features from the CPU necessitate a CPU-specific voltage noise analysis. In this paper, we make the following contributions. First, we provide a voltage noise categorization framework to identify, characterize, and understand voltage noise in the manycore CPU architecture. Second, we perform a microarchitecture-level voltage-droop root-cause analysis for the two major droop types we identify, namely the local first-order droop and the global second-order droop. Third, on the basis of our categorization and characterization, we propose a hierarchical voltage smoothing mechanism that mitigates each type of voltage droop. Our evaluation shows it can reduce up to 31% worst-case droop, which translates to 11.8% core-level and 7.8% processor-level energy reduction.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"161-173"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79918894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mitesh R. Meswani, S. Blagodurov, D. Roberts, John Slice, Mike Ignatowski, G. Loh
{"title":"Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories","authors":"Mitesh R. Meswani, S. Blagodurov, D. Roberts, John Slice, Mike Ignatowski, G. Loh","doi":"10.1109/HPCA.2015.7056027","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056027","url":null,"abstract":"Die-stacked DRAM is a technology that will soon be integrated in high-performance systems. Recent studies have focused on hardware caching techniques to make use of the stacked memory, but these approaches require complex changes to the processor and also cannot leverage the stacked memory to increase the system's overall memory capacity. In this work, we explore the challenges of exposing the stacked DRAM as part of the system's physical address space. This non-uniform access memory (NUMA) styled approach greatly simplifies the hardware and increases the physical memory capacity of the system, but pushes the burden of managing the heterogeneous memory architecture (HMA) to the software layers. We first explore simple (and somewhat impractical) schemes to manage the HMA, and then refine the mechanisms to address a variety of hardware and software implementation challenges. In the end, we present an HMA approach with low hardware and software impact that can dynamically tune itself to different application scenarios, achieving performance even better than the (impractical-to-implement) baseline approaches.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"126-136"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91098097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gennady Pekhimenko, Tyler Huberty, Rui Cai, O. Mutlu, Phillip B. Gibbons, M. Kozuch, T. Mowry
{"title":"Exploiting compressed block size as an indicator of future reuse","authors":"Gennady Pekhimenko, Tyler Huberty, Rui Cai, O. Mutlu, Phillip B. Gibbons, M. Kozuch, T. Mowry","doi":"10.1109/HPCA.2015.7056021","DOIUrl":"https://doi.org/10.1109/HPCA.2015.7056021","url":null,"abstract":"We introduce a set of new Compression-Aware Management Policies (CAMP) for on-chip caches that employ data compression. Our management policies are based on two key ideas. First, we show that it is possible to build a more efficient management policy for compressed caches if the compressed block size is directly used in calculating the value (importance) of a block to the cache. This leads to Minimal-Value Eviction (MVE), a policy that evicts the cache blocks with the least value, based on both the size and the expected future reuse. Second, we show that, in some cases, compressed block size can be used as an efficient indicator of the future reuse of a cache block. We use this idea to build a new insertion policy called Size-based Insertion Policy (SIP) that dynamically prioritizes cache blocks using their compressed size as an indicator. We compare CAMP (and its global variant G-CAMP) to prior on-chip cache management policies (both size-oblivious and size-aware) and find that our mechanisms are more effective in using compressed block size as an extra dimension in cache management decisions. Our results show that the proposed management policies (i) decrease off-chip bandwidth consumption (by 8.7% in single-core), (ii) decrease memory subsystem energy consumption (by 7.2% in single-core) for memory intensive workloads compared to the best prior mechanism, and (iii) improve performance (by 4.9%/9.0%/10.2% on average in single-/two-/four-core workload evaluations and up to 20.1%) CAMP is effective for a variety of compression algorithms and different cache designs with local and global replacement strategies.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"40 1","pages":"51-63"},"PeriodicalIF":0.0,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91190519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}