Proceedings of the workshop on Memory Systems Performance and Correctness最新文献

筛选
英文 中文
O-structures: semantics for versioned memory 结构:版本化内存的语义
Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618130
Eran Gilad, E. W. Mackay, M. Oskin, Yoav Etsion
{"title":"O-structures: semantics for versioned memory","authors":"Eran Gilad, E. W. Mackay, M. Oskin, Yoav Etsion","doi":"10.1145/2618128.2618130","DOIUrl":"https://doi.org/10.1145/2618128.2618130","url":null,"abstract":"This paper introduces O-structures, a novel architectural memory element that can be used to facilitate parallelism in task-based execution models. Much like register renaming, each write to an O-structure creates a new version of program memory at that location. These versions can be accessed concurrently and out of program order. O-structures provide a set of semantics that match the needs of task-based execution models, specifically allowing tasks to synchronize on specific versions of memory as well as coordinate access when the necessary version is not known at compile time. In this work, we describe O-structures and provide their complete semantics. We also discuss how a task-based execution of basic data structure manipulations on common data structures (arrays, lists, trees, etc) operate. Results are presented that measure the exposed memory-level parallelism (MLP) in these operations. We find that for previously difficult to parallelize data-structures, such as linked lists, binary trees and sparse-matrix codes we see significant memory level parallelism (50--100 operations per cycle) when using O-structures.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123708797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Trash in cache: detecting eternally silent stores 缓存中的垃圾:检测永远沉默的存储
Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618133
Jonathan A. Shidal, Zach Gottlieb, R. Cytron, K. Kavi
{"title":"Trash in cache: detecting eternally silent stores","authors":"Jonathan A. Shidal, Zach Gottlieb, R. Cytron, K. Kavi","doi":"10.1145/2618128.2618133","DOIUrl":"https://doi.org/10.1145/2618128.2618133","url":null,"abstract":"The gap between processing and storage speeds remains a concern for computer system designers and application developers. This disparity can be bridged in part by eliminating unnecessary stores, thereby reducing the amount of traffic that flows from the processor and first-level caches to the slower components of the storage subsystem. Reducing the \"write\" traffic can improve program performance, save power, and increase the longevity of storage components that have limited write endurance. Techniques have been proposed and evaluated for identifying various classes of stores that can be silenced. A relatively unexplored class of such stores are those that would write data that is dirty, but dead. Such data appears as if it needs to be written back to memory from cache, yet it can be proven that the application can never subsequently access the data. In this paper, we suggest identifying garbage (trash) in cache, so that the dirty bytes associated with the trash need not be written to memory. We propose and evaluate a simple technique based on reference counting that finds a subset of these \"eternally silent\" (dead) stores. When applied to popular benchmarks, our results show that a significant fraction of the writes to memory can be silenced based on the impossibility of an application subsequently accessing the data.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131560461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Main memory and cache performance of intel sandy bridge and AMD bulldozer 英特尔沙桥和AMD推土机的主存和缓存性能
Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618129
Daniel Molka, D. Hackenberg, R. Schöne
{"title":"Main memory and cache performance of intel sandy bridge and AMD bulldozer","authors":"Daniel Molka, D. Hackenberg, R. Schöne","doi":"10.1145/2618128.2618129","DOIUrl":"https://doi.org/10.1145/2618128.2618129","url":null,"abstract":"Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. Much more often, limitations are caused by the memory subsystem, particularly shared resources such as last level caches or memory controllers. Measuring, predicting and modeling memory performance becomes a steeper challenge with each new processor generation due to the growing complexity and core count. We tackle the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details. For this, we build upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem. These benchmarks are extended to support AVX instructions for bandwidth measurements and to integrate the coherence states (O)wned and (F)orward. We then use these benchmarks to perform an indepth analysis of current ccNUMA multiprocessor systems with Intel (Sandy Bridge-EP) and AMD (Bulldozer) processors. Using our benchmarks we present fundamental memory performance data and illustrate performance-relevant architectural properties of both designs.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114966222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
A study of connected object locality in NUMA heaps NUMA堆中连通对象局部性的研究
Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618132
Khaled Alnowaiser
{"title":"A study of connected object locality in NUMA heaps","authors":"Khaled Alnowaiser","doi":"10.1145/2618128.2618132","DOIUrl":"https://doi.org/10.1145/2618128.2618132","url":null,"abstract":"Reference locality is vital to the performance of parallel Garbage Collection (GC) running on Non-Uniform Memory Access (NUMA) machines. A GC thread may trace remotely placed objects that descend from the root set or, for load balance, a GC thread may steal non-local objects from other threads' work lists. Processing distant live objects could introduce contention in the interconnect links between memory nodes and it could increase memory access latency. Researchers have proposed various techniques to improve GC tracing locality. However, few studies attempt to optimize the locality of connected objects in NUMA object graph. In this paper, we study the locality of a rooted subgraph, a unit of object connectivity in the object graph. A rooted subgraph is a set of references containing one root reference, and every reference is reachable from the root. We empirically study the locality of rooted subgraphs of DaCapo and SPECjbb2005 benchmark suites. The results show that more than 80% of objects in a rooted subgraph are located in the same memory node as the root object. We then propose a GC locality optimization that uses the root memory node as a heuristic to guide GC threads processing local objects.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116504515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Proceedings of the workshop on Memory Systems Performance and Correctness 存储系统性能和正确性研讨会论文集
Jeremy Singer, Milind Kulkarni, T. Harris
{"title":"Proceedings of the workshop on Memory Systems Performance and Correctness","authors":"Jeremy Singer, Milind Kulkarni, T. Harris","doi":"10.1145/2618128","DOIUrl":"https://doi.org/10.1145/2618128","url":null,"abstract":"Memory continues to be a major bottleneck in almost all computing systems. It is becoming more so as more cores and other agents are sharing parts of the memory system, and as applications that run on the cores are becoming increasingly data intensive. Continuing the tradition of eight previous successful incarnations, MSPC 2014 provided a forum for discussing all aspects of memory performance and correctness on a variety of systems (multi-core, desktop, embedded, server/cloud, high-performance computing, sensor, etc) and related software and hardware innovations at various levels of the technology stack.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123886040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Feedback directed optimization of TCMalloc TCMalloc的反馈定向优化
Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618131
Sangho Lee, Teresa L. Johnson, Easwaran Raman
{"title":"Feedback directed optimization of TCMalloc","authors":"Sangho Lee, Teresa L. Johnson, Easwaran Raman","doi":"10.1145/2618128.2618131","DOIUrl":"https://doi.org/10.1145/2618128.2618131","url":null,"abstract":"TCMalloc [9] is an open-source memory allocator. Its use of thread-local caches of free objects enables most allocations/deallocations to be satisfied from thread-local heaps not requiring locks, making it a highly scalable memory allocator for multi-threaded applications. TCMalloc code contains several parameters that control the thread-local caches. The values of these parameters have been carefully chosen to provide good performance for the common case. However, as we will show, the optimal values of these parameters depend upon application-specific memory allocation behavior, so there is no one configuration that attains the optimal performance in all applications. In light of this, this paper presents a feedback-directed optimization of TCMalloc. The proposed optimization method targets the batch sizes, which determine the aggressiveness and timing of thread cache management mechanisms that move free objects between central and thread-local caches. It aims to tailor the batch sizes to application behavior, in order to make prefetching from the central cache aggressive enough to reduce unnecessary synchronization, without causing other performance problems due to excessive garbage collection of free objects in the thread caches. To this end, the optimization method observes a target application during a profile run and uses an iterative algorithm to compute batch sizes. Empirical results show that the proposed optimization results in up to 10% performance improvement over the default configuration on Google internal benchmark applications.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125601488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Nonvolatile memory is a broken time machine 非易失性存储器是一台坏掉的时间机器
Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618136
Benjamin Ransford, Brandon Lucia
{"title":"Nonvolatile memory is a broken time machine","authors":"Benjamin Ransford, Brandon Lucia","doi":"10.1145/2618128.2618136","DOIUrl":"https://doi.org/10.1145/2618128.2618136","url":null,"abstract":"Energy harvesting enables intermittently powered devices to compute without built-in power. But frequent power failures, combined with nonvolatile memory intended to protect computational state, introduce strange control flow that turns sequential code into unwieldy concurrent code: programs must grapple with their own state from previous interrupted runs. This paper describes the broken time machine problem for these devices and outlines potential solutions from the perspective of safe concurrent programming.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130727984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 85
Affinity-based hash tables 基于亲和力的哈希表
Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618135
Brian Gernhardt, Rahman Lavaee, C. Ding
{"title":"Affinity-based hash tables","authors":"Brian Gernhardt, Rahman Lavaee, C. Ding","doi":"10.1145/2618128.2618135","DOIUrl":"https://doi.org/10.1145/2618128.2618135","url":null,"abstract":"From a trace of data accesses, it is possible to calculate an affinity hierarchy that groups related data together. Combining this hierarchy with the extremely common hash table, there is an opportunity to both improve cache performance and enable novel applications. This paper describes both the construction of the affinity hierarchy and its application to hash tables.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114480521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Outlawing ghosts: avoiding out-of-thin-air results 取缔幽灵:避免无中生有的结果
Proceedings of the workshop on Memory Systems Performance and Correctness Pub Date : 2014-06-13 DOI: 10.1145/2618128.2618134
H. Boehm, Brian Demsky
{"title":"Outlawing ghosts: avoiding out-of-thin-air results","authors":"H. Boehm, Brian Demsky","doi":"10.1145/2618128.2618134","DOIUrl":"https://doi.org/10.1145/2618128.2618134","url":null,"abstract":"It is very difficult to define a programming language memory model for shared variables that both • allows programmers to take full advantage of weakly-ordered memory operations, but still • correctly disallows so-called \"out-of-thin-air\" results, i.e. results that can be justified only via reasoning that is in some sense circular. Real programming language implementations do not produce out-of-thin-air results. Architectural specifications successfully disallow them. Nonetheless, the difficulty of disallowing them in language specifications causes real, and serious, problems. In the absence of such a specification, essentially all precise reasoning about non-trivial programs becomes impractical. This remains a critical open problem in the specifications of Java, C, and C++, among others. We argue that there are plausible and relatively straight-forward solutions, but their performance impact requires further study. In the long run, they are likely to require strengthening of some hardware guarantees, so that they translate properly to guarantees at the programming language source level.","PeriodicalId":181419,"journal":{"name":"Proceedings of the workshop on Memory Systems Performance and Correctness","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122347980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 95
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信