Proceedings of the workshop on Memory Systems Performance and Correctness最新文献

O-structures: semantics for versioned memory 结构:版本化内存的语义

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618130

Eran Gilad, E. W. Mackay, M. Oskin, Yoav Etsion

引用次数: 2

Trash in cache: detecting eternally silent stores 缓存中的垃圾:检测永远沉默的存储

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618133

Jonathan A. Shidal, Zach Gottlieb, R. Cytron, K. Kavi

{"title":"Trash in cache: detecting eternally silent stores","authors":"Jonathan A. Shidal, Zach Gottlieb, R. Cytron, K. Kavi","doi":"10.1145/2618128.2618133","DOIUrl":"https://doi.org/10.1145/2618128.2618133","url":null,"abstract":"The gap between processing and storage speeds remains a concern for computer system designers and application developers. This disparity can be bridged in part by eliminating unnecessary stores, thereby reducing the amount of traffic that flows from the processor and first-level caches to the slower components of the storage subsystem. Reducing the \"write\" traffic can improve program performance, save power, and increase the longevity of storage components that have limited write endurance. Techniques have been proposed and evaluated for identifying various classes of stores that can be silenced. A relatively unexplored class of such stores are those that would write data that is dirty, but dead. Such data appears as if it needs to be written back to memory from cache, yet it can be proven that the application can never subsequently access the data. In this paper, we suggest identifying garbage (trash) in cache, so that the dirty bytes associated with the trash need not be written to memory. We propose and evaluate a simple technique based on reference counting that finds a subset of these \"eternally silent\" (dead) stores. When applied to popular benchmarks, our results show that a significant fraction of the writes to memory can be silenced based on the impossibility of an application subsequently accessing the data.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131560461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Main memory and cache performance of intel sandy bridge and AMD bulldozer 英特尔沙桥和AMD推土机的主存和缓存性能

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618129

Daniel Molka, D. Hackenberg, R. Schöne

{"title":"Main memory and cache performance of intel sandy bridge and AMD bulldozer","authors":"Daniel Molka, D. Hackenberg, R. Schöne","doi":"10.1145/2618128.2618129","DOIUrl":"https://doi.org/10.1145/2618128.2618129","url":null,"abstract":"Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. Much more often, limitations are caused by the memory subsystem, particularly shared resources such as last level caches or memory controllers. Measuring, predicting and modeling memory performance becomes a steeper challenge with each new processor generation due to the growing complexity and core count. We tackle the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details. For this, we build upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem. These benchmarks are extended to support AVX instructions for bandwidth measurements and to integrate the coherence states (O)wned and (F)orward. We then use these benchmarks to perform an indepth analysis of current ccNUMA multiprocessor systems with Intel (Sandy Bridge-EP) and AMD (Bulldozer) processors. Using our benchmarks we present fundamental memory performance data and illustrate performance-relevant architectural properties of both designs.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114966222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

A study of connected object locality in NUMA heaps NUMA堆中连通对象局部性的研究

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618132

Khaled Alnowaiser

{"title":"A study of connected object locality in NUMA heaps","authors":"Khaled Alnowaiser","doi":"10.1145/2618128.2618132","DOIUrl":"https://doi.org/10.1145/2618128.2618132","url":null,"abstract":"Reference locality is vital to the performance of parallel Garbage Collection (GC) running on Non-Uniform Memory Access (NUMA) machines. A GC thread may trace remotely placed objects that descend from the root set or, for load balance, a GC thread may steal non-local objects from other threads' work lists. Processing distant live objects could introduce contention in the interconnect links between memory nodes and it could increase memory access latency. Researchers have proposed various techniques to improve GC tracing locality. However, few studies attempt to optimize the locality of connected objects in NUMA object graph. In this paper, we study the locality of a rooted subgraph, a unit of object connectivity in the object graph. A rooted subgraph is a set of references containing one root reference, and every reference is reachable from the root. We empirically study the locality of rooted subgraphs of DaCapo and SPECjbb2005 benchmark suites. The results show that more than 80% of objects in a rooted subgraph are located in the same memory node as the root object. We then propose a GC locality optimization that uses the root memory node as a heuristic to guide GC threads processing local objects.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116504515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Proceedings of the workshop on Memory Systems Performance and Correctness 存储系统性能和正确性研讨会论文集

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128

Jeremy Singer, Milind Kulkarni, T. Harris

引用次数: 1

Feedback directed optimization of TCMalloc TCMalloc的反馈定向优化

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618131

Sangho Lee, Teresa L. Johnson, Easwaran Raman

{"title":"Feedback directed optimization of TCMalloc","authors":"Sangho Lee, Teresa L. Johnson, Easwaran Raman","doi":"10.1145/2618128.2618131","DOIUrl":"https://doi.org/10.1145/2618128.2618131","url":null,"abstract":"TCMalloc [9] is an open-source memory allocator. Its use of thread-local caches of free objects enables most allocations/deallocations to be satisfied from thread-local heaps not requiring locks, making it a highly scalable memory allocator for multi-threaded applications. TCMalloc code contains several parameters that control the thread-local caches. The values of these parameters have been carefully chosen to provide good performance for the common case. However, as we will show, the optimal values of these parameters depend upon application-specific memory allocation behavior, so there is no one configuration that attains the optimal performance in all applications. In light of this, this paper presents a feedback-directed optimization of TCMalloc. The proposed optimization method targets the batch sizes, which determine the aggressiveness and timing of thread cache management mechanisms that move free objects between central and thread-local caches. It aims to tailor the batch sizes to application behavior, in order to make prefetching from the central cache aggressive enough to reduce unnecessary synchronization, without causing other performance problems due to excessive garbage collection of free objects in the thread caches. To this end, the optimization method observes a target application during a profile run and uses an iterative algorithm to compute batch sizes. Empirical results show that the proposed optimization results in up to 10% performance improvement over the default configuration on Google internal benchmark applications.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125601488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Nonvolatile memory is a broken time machine 非易失性存储器是一台坏掉的时间机器

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618136

Benjamin Ransford, Brandon Lucia

引用次数: 85

Affinity-based hash tables 基于亲和力的哈希表

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618135

Brian Gernhardt, Rahman Lavaee, C. Ding

引用次数: 0

Outlawing ghosts: avoiding out-of-thin-air results 取缔幽灵:避免无中生有的结果

Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618134

H. Boehm, Brian Demsky

引用次数: 95