{"title":"Towards Workload-Aware Page Cache Replacement Policies for Hybrid Memories","authors":"Ahsen J. Uppal, Mitesh R. Meswani","doi":"10.1145/2818950.2818978","DOIUrl":"https://doi.org/10.1145/2818950.2818978","url":null,"abstract":"Die-stacked DRAM is an emerging technology that is expected to be integrated in future systems with off-package memories resulting in a hybrid memory system. A large body of recent research has investigated the use of die-stacked dynamic random-access memory (DRAM) as a hardware-manged last-level cache. This approach comes at the costs of managing large tag arrays, increased hit latencies, and potentially significant increases in hardware verification costs. An alternative approach is for the operating system (OS) to manage the die-stacked DRAM as a page cache for off-package memories. However, recent work in OS-managed page cache focuses on FIFO replacement and related variants as the baseline management policy. In this paper, we take a step back and investigate classical OS page replacement policies and re-evaluate them for hybrid memories. We find that when we use different die-stacked DRAM sizes, the choice of best management policy depends on cache size and application, and can result in as much as a 13X performance difference. Furthermore, within a single application run, the choice of best policy varies over time. We also evaluate co-scheduled workload pairs and find that the best policy varies by workload pair and cache configuration, and that the best-performing policy is typically the most fair. Our research motivates us to continue our investigation for developing workload-aware and cache configuration-aware page cache management policies.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"35 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MEMST: Cloning Memory Behavior using Stochastic Traces","authors":"Ganesh Balakrishnan, Yan Solihin","doi":"10.1145/2818950.2818971","DOIUrl":"https://doi.org/10.1145/2818950.2818971","url":null,"abstract":"Memory Controller and DRAM architecture are critical aspects of Chip Multi Processor (CMP) design. A good design needs an in-depth understanding of end-user workloads. However, designers rarely get insights into end-user workloads because of the proprietary nature of source code or data. Workload cloning is an emerging approach that can bridge this gap by creating a proxy for the proprietary workload (clone). Cloning involves profiling workloads to glean key statistics and then generating a clone offline for use in the design environment. However, there are no existing cloning techniques for accurately capturing memory controller and DRAM behavior that can be used by designers for a wide design space exploration. We propose Memory EMulation using Stochastic Traces, MEMST, a highly accurate black box cloning framework for capturing DRAM and MC behavior. We provide a detailed analysis of statistics that are necessary to model a workload accurately. We will also show how a clone can be generated from these statistics using a novel stochastic method. Finally, we will validate our framework across a wide design space by varying DRAM organization, address mapping, DRAM frequency, page policy, scheduling policy, input bus bandwidth, chipset latency, DRAM die revision, DRAM generation and DRAM refresh policy. We evaluated MEMST using CPU2006, BioBench, Stream and PARSEC benchmark suites across the design space for single-core, dual-core, quad-core and octa-core CMPs. We measured both performance and power metrics for the original workload and clones. The clones show a very high degree of correlation with the original workload for over 7900 data points with an average error of 1.8% and 1.6% for transaction latency and DRAM power respectively.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114149962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Design Metrics for Datacenter DRAM","authors":"M. Awasthi","doi":"10.1145/2818950.2818973","DOIUrl":"https://doi.org/10.1145/2818950.2818973","url":null,"abstract":"Over the years, the evolution of DRAM has provided a little improvement in access latencies, but has been optimized to deliver greater peak bandwidths from the devices. The combined bandwidth in a contemporary multi-socket server system runs into hundreds of GB/s. However datacenter scale applications running on server platforms care largely about having access to a large pool of low-latency main memory (DRAM), and in the best case, are unable to utilize even a small fraction of the total memory bandwidth. In this extended abstract, we use measured data from the state-of-the-art servers running memory intensive datacenter workloads like Memcached to argue for main memory design to steer away from optimizing traditional metrics for DRAM design like peak bandwidth so as to be able to cater the growing needs to the datacenter server industry for high density, low latency memory with moderate bandwidth requirements.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127906438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"S-L1: A Software-based GPU L1 Cache that Outperforms the Hardware L1 for Data Processing Applications","authors":"Reza Mokhtari, M. Stumm","doi":"10.1145/2818950.2818969","DOIUrl":"https://doi.org/10.1145/2818950.2818969","url":null,"abstract":"Implementing a GPU L1 data cache entirely in software to usurp the hardware L1 cache sounds counter-intuitive. However, we show how a software L1 cache can perform significantly better than the hardware L1 cache for data-intensive streaming (i.e., \"Big-Data\") GPGPU applications. Hardware L1 data caches can perform poorly on current GPUs, because the size of the L1 is far too small and its cache line size is too large given the number of threads that typically need to run in parallel. Our paper makes two contributions. First, we experimentally characterize the performance behavior of modern GPU memory hierarchies and in doing so identify a number of bottlenecks. Secondly, we describe the design and implementation of a software L1 cache, S-L1. On ten streaming GPGPU applications, S-L1 performs 1.9 times faster, on average, when compared to using the default hardware L1, and 2.1 times faster, on average, when compared to using no L1 cache.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125789018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implications of Memory Interference for Composed HPC Applications","authors":"Brian Kocoloski, Yuyu Zhou, B. Childers, J. Lange","doi":"10.1145/2818950.2818965","DOIUrl":"https://doi.org/10.1145/2818950.2818965","url":null,"abstract":"The cost of inter-node I/O and data movement is becoming increasingly prohibitive for large scale High Performance Computing (HPC) applications. This trend is leading to the emergence of composed in situ applications that co-locate multiple components on the same node. However, these components may contend for underlying memory system resources. In this extended research abstract, we present a preliminary evaluation of the impacts of contention for shared resources in the memory hierarchy, including the last level cache (LLC) and DRAM bandwidth. We show that even modest levels of memory contention can have substantial performance implications for some benchmarks, and argue for a cross layer approach to resource partitioning and scheduling on future HPC systems.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125933119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals","authors":"Lifeng Nai, Hyesoon Kim","doi":"10.1145/2818950.2818982","DOIUrl":"https://doi.org/10.1145/2818950.2818982","url":null,"abstract":"Processing in Memory (PIM) was first proposed decades ago for reducing the overhead of data movement between core and memory. With the advances in 3D-stacking technologies, recently PIM architectures have regained researchers' attentions. Several fully-programmable PIM architectures as well as programming models were proposed in previous literature. Meanwhile, memory industry also starts to integrate computation units into Hybrid Memory Cube (HMC). In HMC 2.0 specification, a number of atomic instructions are supported. Although the instruction support is limited, it enables us to offload computations at instruction granularity. In this paper, we present a preliminary study of instruction offloading on HMC 2.0 using graph traversals as an example. By demonstrating the programmability and performance benefits, we show the feasibility of an instruction-level offloading PIM architecture.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134380293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Near memory data structure rearrangement","authors":"M. Gokhale, G. S. Lloyd, C. Hajas","doi":"10.1145/2818950.2818986","DOIUrl":"https://doi.org/10.1145/2818950.2818986","url":null,"abstract":"As CPU core counts continue to increase, the gap between compute power and available memory bandwidth has widened. A larger and deeper cache hierarchy benefits locality-friendly computation, but offers limited improvement to irregular, data intensive applications. In this work we explore a novel approach to accelerating these applications through in-memory data restructuring. Unlike other proposed processing-in-memory architectures, the rearrangement hardware performs data reduction, not compute offload. Using a custom FPGA emulator, we quantitatively evaluate performance and energy benefits of near-memory hardware structures that dynamically restructure in-memory data to cache-friendly layout, minimizing wasted memory bandwidth. Our results on representative irregular benchmarks using the Micron Hybrid Memory Cube memory model show speedup, bandwidth savings, and energy reduction. We present an API for the near-memory accelerator and describe the interaction between the CPU and the rearrangement hardware with application examples. The merits of an SRAM vs. a DRAM scratchpad buffer for rearranged data are explored.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134525205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinchun Kim, Viacheslav V. Fedorov, Paul V. Gratz, A. Reddy
{"title":"Dynamic Memory Pressure Aware Ballooning","authors":"Jinchun Kim, Viacheslav V. Fedorov, Paul V. Gratz, A. Reddy","doi":"10.1145/2818950.2818967","DOIUrl":"https://doi.org/10.1145/2818950.2818967","url":null,"abstract":"Hardware virtualization is a major component of large scale server and data center deployments due to their facilitation of server consolidation and scalability. Virtualization, however, comes at a high cost in terms of system main memory utilization. Current virtual machine (VM) memory management solutions impose a high performance penalty and are oblivious to the operating regime of the system. Therefore, there is a great need for low-impact VM memory management techniques which are aware of and reactive to current system state, to drive down the overheads of virtualization. We observe that the host machine operates under different memory pressure regimes, as the memory demand from guest VMs changes dynamically at runtime. Adapting to this runtime system state is critical to reduce the performance cost of VM memory management. In this paper, we propose a novel dynamic memory management policy called Memory Pressure Aware (MPA) ballooning. MPA ballooning dynamically allocates memory resources to each VM based on the current memory pressure regime. Moreover, MPA ballooning proactively reacts and adapts to sudden changes in memory demand from guest VMs. MPA ballooning requires neither additional hardware support, nor incurs extra minor page faults in its memory pressure estimation. We show that MPA ballooning provides an 13.2% geomean speed-up versus the current ballooning techniques across a set of application mixes running in guest VMs; often yielding performance nearly identical to that of a non-memory constrained system.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"287 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114549057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Software Techniques for Scratchpad Memory Management","authors":"Paul Sebexen, Thomas Sohmers","doi":"10.1145/2818950.2818966","DOIUrl":"https://doi.org/10.1145/2818950.2818966","url":null,"abstract":"Scratchpad memory is commonly encountered in embedded systems as an alternative or supplement to caches [3], however, cache-containing architectures continue to be preferred in many applications due to their general ease of programmability. A language-agnostic software management system is envisioned that improves portability to scratchpad architectures and significantly lowers power consumption of ported applications. We review a selection of existing techniques, discuss their applicability to various memory systems, and identify opportunities for applying new methods and optimizations to improve memory management on relevant architectures.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130978381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architecture Exploration for Data Intensive Applications","authors":"Fernando Martin del Campo, P. Chow","doi":"10.1145/2818950.2818970","DOIUrl":"https://doi.org/10.1145/2818950.2818970","url":null,"abstract":"This paper presents Compass, a hardware/software simulator for data-intensive applications. Currently focusing on in-memory stores, the objective of the simulator is to explore diverse algorithms and hardware architectures, serving as an aid to design systems for applications in which the elevated rate of data transfers dictates their behaviour. Instead of simulating the devices of a conventional computing system, in Compass the modules represent the stages of the procedure to attend a request to store, retrieve, or delete information in a particular memory architecture, giving the simulator the flexibility to test and analyze several different algorithms, components, and ideas. The system maintains a cycle-accurate model that makes it easy to interface it with simulators of physical devices such as RAM memories. Under a scheme like this one, the simulator of a physical memory in the system anchors the timing to a realistic scenario, but the rest of the components can be easily modified to explore alternative approaches.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124848696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}