Proceedings of the 2015 International Symposium on Memory Systems最新文献

筛选
英文 中文
Towards Workload-Aware Page Cache Replacement Policies for Hybrid Memories 面向工作负载感知的混合内存页面缓存替换策略
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818978
Ahsen J. Uppal, Mitesh R. Meswani
{"title":"Towards Workload-Aware Page Cache Replacement Policies for Hybrid Memories","authors":"Ahsen J. Uppal, Mitesh R. Meswani","doi":"10.1145/2818950.2818978","DOIUrl":"https://doi.org/10.1145/2818950.2818978","url":null,"abstract":"Die-stacked DRAM is an emerging technology that is expected to be integrated in future systems with off-package memories resulting in a hybrid memory system. A large body of recent research has investigated the use of die-stacked dynamic random-access memory (DRAM) as a hardware-manged last-level cache. This approach comes at the costs of managing large tag arrays, increased hit latencies, and potentially significant increases in hardware verification costs. An alternative approach is for the operating system (OS) to manage the die-stacked DRAM as a page cache for off-package memories. However, recent work in OS-managed page cache focuses on FIFO replacement and related variants as the baseline management policy. In this paper, we take a step back and investigate classical OS page replacement policies and re-evaluate them for hybrid memories. We find that when we use different die-stacked DRAM sizes, the choice of best management policy depends on cache size and application, and can result in as much as a 13X performance difference. Furthermore, within a single application run, the choice of best policy varies over time. We also evaluate co-scheduled workload pairs and find that the best policy varies by workload pair and cache configuration, and that the best-performing policy is typically the most fair. Our research motivates us to continue our investigation for developing workload-aware and cache configuration-aware page cache management policies.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"35 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
MEMST: Cloning Memory Behavior using Stochastic Traces MEMST:使用随机轨迹克隆记忆行为
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818971
Ganesh Balakrishnan, Yan Solihin
{"title":"MEMST: Cloning Memory Behavior using Stochastic Traces","authors":"Ganesh Balakrishnan, Yan Solihin","doi":"10.1145/2818950.2818971","DOIUrl":"https://doi.org/10.1145/2818950.2818971","url":null,"abstract":"Memory Controller and DRAM architecture are critical aspects of Chip Multi Processor (CMP) design. A good design needs an in-depth understanding of end-user workloads. However, designers rarely get insights into end-user workloads because of the proprietary nature of source code or data. Workload cloning is an emerging approach that can bridge this gap by creating a proxy for the proprietary workload (clone). Cloning involves profiling workloads to glean key statistics and then generating a clone offline for use in the design environment. However, there are no existing cloning techniques for accurately capturing memory controller and DRAM behavior that can be used by designers for a wide design space exploration. We propose Memory EMulation using Stochastic Traces, MEMST, a highly accurate black box cloning framework for capturing DRAM and MC behavior. We provide a detailed analysis of statistics that are necessary to model a workload accurately. We will also show how a clone can be generated from these statistics using a novel stochastic method. Finally, we will validate our framework across a wide design space by varying DRAM organization, address mapping, DRAM frequency, page policy, scheduling policy, input bus bandwidth, chipset latency, DRAM die revision, DRAM generation and DRAM refresh policy. We evaluated MEMST using CPU2006, BioBench, Stream and PARSEC benchmark suites across the design space for single-core, dual-core, quad-core and octa-core CMPs. We measured both performance and power metrics for the original workload and clones. The clones show a very high degree of correlation with the original workload for over 7900 data points with an average error of 1.8% and 1.6% for transaction latency and DRAM power respectively.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114149962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Rethinking Design Metrics for Datacenter DRAM 重新思考数据中心DRAM的设计指标
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818973
M. Awasthi
{"title":"Rethinking Design Metrics for Datacenter DRAM","authors":"M. Awasthi","doi":"10.1145/2818950.2818973","DOIUrl":"https://doi.org/10.1145/2818950.2818973","url":null,"abstract":"Over the years, the evolution of DRAM has provided a little improvement in access latencies, but has been optimized to deliver greater peak bandwidths from the devices. The combined bandwidth in a contemporary multi-socket server system runs into hundreds of GB/s. However datacenter scale applications running on server platforms care largely about having access to a large pool of low-latency main memory (DRAM), and in the best case, are unable to utilize even a small fraction of the total memory bandwidth. In this extended abstract, we use measured data from the state-of-the-art servers running memory intensive datacenter workloads like Memcached to argue for main memory design to steer away from optimizing traditional metrics for DRAM design like peak bandwidth so as to be able to cater the growing needs to the datacenter server industry for high density, low latency memory with moderate bandwidth requirements.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127906438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
S-L1: A Software-based GPU L1 Cache that Outperforms the Hardware L1 for Data Processing Applications S-L1:基于软件的GPU L1缓存,在数据处理应用中性能优于硬件L1
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818969
Reza Mokhtari, M. Stumm
{"title":"S-L1: A Software-based GPU L1 Cache that Outperforms the Hardware L1 for Data Processing Applications","authors":"Reza Mokhtari, M. Stumm","doi":"10.1145/2818950.2818969","DOIUrl":"https://doi.org/10.1145/2818950.2818969","url":null,"abstract":"Implementing a GPU L1 data cache entirely in software to usurp the hardware L1 cache sounds counter-intuitive. However, we show how a software L1 cache can perform significantly better than the hardware L1 cache for data-intensive streaming (i.e., \"Big-Data\") GPGPU applications. Hardware L1 data caches can perform poorly on current GPUs, because the size of the L1 is far too small and its cache line size is too large given the number of threads that typically need to run in parallel. Our paper makes two contributions. First, we experimentally characterize the performance behavior of modern GPU memory hierarchies and in doing so identify a number of bottlenecks. Secondly, we describe the design and implementation of a software L1 cache, S-L1. On ten streaming GPGPU applications, S-L1 performs 1.9 times faster, on average, when compared to using the default hardware L1, and 2.1 times faster, on average, when compared to using no L1 cache.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125789018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Implications of Memory Interference for Composed HPC Applications 组合高性能计算应用中内存干扰的含义
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818965
Brian Kocoloski, Yuyu Zhou, B. Childers, J. Lange
{"title":"Implications of Memory Interference for Composed HPC Applications","authors":"Brian Kocoloski, Yuyu Zhou, B. Childers, J. Lange","doi":"10.1145/2818950.2818965","DOIUrl":"https://doi.org/10.1145/2818950.2818965","url":null,"abstract":"The cost of inter-node I/O and data movement is becoming increasingly prohibitive for large scale High Performance Computing (HPC) applications. This trend is leading to the emergence of composed in situ applications that co-locate multiple components on the same node. However, these components may contend for underlying memory system resources. In this extended research abstract, we present a preliminary evaluation of the impacts of contention for shared resources in the memory hierarchy, including the last level cache (LLC) and DRAM bandwidth. We show that even modest levels of memory contention can have substantial performance implications for some benchmarks, and argue for a cross layer approach to resource partitioning and scheduling on future HPC systems.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125933119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals 指令卸载与HMC 2.0标准:图遍历的案例研究
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818982
Lifeng Nai, Hyesoon Kim
{"title":"Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals","authors":"Lifeng Nai, Hyesoon Kim","doi":"10.1145/2818950.2818982","DOIUrl":"https://doi.org/10.1145/2818950.2818982","url":null,"abstract":"Processing in Memory (PIM) was first proposed decades ago for reducing the overhead of data movement between core and memory. With the advances in 3D-stacking technologies, recently PIM architectures have regained researchers' attentions. Several fully-programmable PIM architectures as well as programming models were proposed in previous literature. Meanwhile, memory industry also starts to integrate computation units into Hybrid Memory Cube (HMC). In HMC 2.0 specification, a number of atomic instructions are supported. Although the instruction support is limited, it enables us to offload computations at instruction granularity. In this paper, we present a preliminary study of instruction offloading on HMC 2.0 using graph traversals as an example. By demonstrating the programmability and performance benefits, we show the feasibility of an instruction-level offloading PIM architecture.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134380293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Near memory data structure rearrangement 近内存数据结构重排
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818986
M. Gokhale, G. S. Lloyd, C. Hajas
{"title":"Near memory data structure rearrangement","authors":"M. Gokhale, G. S. Lloyd, C. Hajas","doi":"10.1145/2818950.2818986","DOIUrl":"https://doi.org/10.1145/2818950.2818986","url":null,"abstract":"As CPU core counts continue to increase, the gap between compute power and available memory bandwidth has widened. A larger and deeper cache hierarchy benefits locality-friendly computation, but offers limited improvement to irregular, data intensive applications. In this work we explore a novel approach to accelerating these applications through in-memory data restructuring. Unlike other proposed processing-in-memory architectures, the rearrangement hardware performs data reduction, not compute offload. Using a custom FPGA emulator, we quantitatively evaluate performance and energy benefits of near-memory hardware structures that dynamically restructure in-memory data to cache-friendly layout, minimizing wasted memory bandwidth. Our results on representative irregular benchmarks using the Micron Hybrid Memory Cube memory model show speedup, bandwidth savings, and energy reduction. We present an API for the near-memory accelerator and describe the interaction between the CPU and the rearrangement hardware with application examples. The merits of an SRAM vs. a DRAM scratchpad buffer for rearranged data are explored.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134525205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Dynamic Memory Pressure Aware Ballooning 动态记忆压力感知气球
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818967
Jinchun Kim, Viacheslav V. Fedorov, Paul V. Gratz, A. Reddy
{"title":"Dynamic Memory Pressure Aware Ballooning","authors":"Jinchun Kim, Viacheslav V. Fedorov, Paul V. Gratz, A. Reddy","doi":"10.1145/2818950.2818967","DOIUrl":"https://doi.org/10.1145/2818950.2818967","url":null,"abstract":"Hardware virtualization is a major component of large scale server and data center deployments due to their facilitation of server consolidation and scalability. Virtualization, however, comes at a high cost in terms of system main memory utilization. Current virtual machine (VM) memory management solutions impose a high performance penalty and are oblivious to the operating regime of the system. Therefore, there is a great need for low-impact VM memory management techniques which are aware of and reactive to current system state, to drive down the overheads of virtualization. We observe that the host machine operates under different memory pressure regimes, as the memory demand from guest VMs changes dynamically at runtime. Adapting to this runtime system state is critical to reduce the performance cost of VM memory management. In this paper, we propose a novel dynamic memory management policy called Memory Pressure Aware (MPA) ballooning. MPA ballooning dynamically allocates memory resources to each VM based on the current memory pressure regime. Moreover, MPA ballooning proactively reacts and adapts to sudden changes in memory demand from guest VMs. MPA ballooning requires neither additional hardware support, nor incurs extra minor page faults in its memory pressure estimation. We show that MPA ballooning provides an 13.2% geomean speed-up versus the current ballooning techniques across a set of application mixes running in guest VMs; often yielding performance nearly identical to that of a non-memory constrained system.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"287 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114549057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Software Techniques for Scratchpad Memory Management Scratchpad内存管理软件技术
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818966
Paul Sebexen, Thomas Sohmers
{"title":"Software Techniques for Scratchpad Memory Management","authors":"Paul Sebexen, Thomas Sohmers","doi":"10.1145/2818950.2818966","DOIUrl":"https://doi.org/10.1145/2818950.2818966","url":null,"abstract":"Scratchpad memory is commonly encountered in embedded systems as an alternative or supplement to caches [3], however, cache-containing architectures continue to be preferred in many applications due to their general ease of programmability. A language-agnostic software management system is envisioned that improves portability to scratchpad architectures and significantly lowers power consumption of ported applications. We review a selection of existing techniques, discuss their applicability to various memory systems, and identify opportunities for applying new methods and optimizations to improve memory management on relevant architectures.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130978381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Architecture Exploration for Data Intensive Applications 数据密集型应用的架构探索
Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818970
Fernando Martin del Campo, P. Chow
{"title":"Architecture Exploration for Data Intensive Applications","authors":"Fernando Martin del Campo, P. Chow","doi":"10.1145/2818950.2818970","DOIUrl":"https://doi.org/10.1145/2818950.2818970","url":null,"abstract":"This paper presents Compass, a hardware/software simulator for data-intensive applications. Currently focusing on in-memory stores, the objective of the simulator is to explore diverse algorithms and hardware architectures, serving as an aid to design systems for applications in which the elevated rate of data transfers dictates their behaviour. Instead of simulating the devices of a conventional computing system, in Compass the modules represent the stages of the procedure to attend a request to store, retrieve, or delete information in a particular memory architecture, giving the simulator the flexibility to test and analyze several different algorithms, components, and ideas. The system maintains a cycle-accurate model that makes it easy to interface it with simulators of physical devices such as RAM memories. Under a scheme like this one, the simulator of a physical memory in the system anchors the timing to a realistic scenario, but the rest of the components can be easily modified to explore alternative approaches.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124848696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信