Matthew Davis, P. Schachte, Z. Somogyi, H. Søndergaard
{"title":"A low overhead method for recovering unused memory inside regions","authors":"Matthew Davis, P. Schachte, Z. Somogyi, H. Søndergaard","doi":"10.1145/2492408.2492415","DOIUrl":"https://doi.org/10.1145/2492408.2492415","url":null,"abstract":"Automating memory management improves both resource safety and programmer productivity. One approach, region-based memory management [9] (RBMM), applies compile-time reasoning to identify points in a program at which memory can be safely reclaimed. The main advantage of RBMM over traditional garbage collection (GC) is the avoidance of expensive runtime analysis, which makes reclaiming memory much faster. On the other hand, GC requires no static analysis, and, operating at runtime, can have significantly more accurate information about object lifetimes. In this paper we propose a hybrid system that seeks to combine the advantages of both methods while avoiding the overheads that previous hybrid systems incurred. Our system can also reclaim array segments whose elements are no longer reachable.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133399906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, H. Simhadri
{"title":"Program-centric cost models for locality","authors":"G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, H. Simhadri","doi":"10.1145/2492408.2492417","DOIUrl":"https://doi.org/10.1145/2492408.2492417","url":null,"abstract":"In this position paper, we argue that cost models for locality in parallel machines should be program-centric, not machine-centric.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124291155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can seqlocks get along with programming language memory models?","authors":"H. Boehm","doi":"10.1145/2247684.2247688","DOIUrl":"https://doi.org/10.1145/2247684.2247688","url":null,"abstract":"Seqlocks are an important synchronization mechanism and represent a significant improvement over conventional reader-writer locks in some contexts. They avoid the need to update a synchronization variable during a reader critical section, and hence improve performance by avoiding cache coherence misses on the lock object itself. Unfortunately, they rely on speculative racing loads inside the critical section. This makes them an interesting problem case for programming-language-level memory models that emphasize data-race-free programming. We analyze a variety of implementation alternatives within the C++11 memory model, and briefly address the corresponding issue in Java. In the process, we observe that there may be a use for \"read-dont-modify-write\" operations, i. e. read-modify-write operations that atomically write back the original value, without modifying it, solely for the memory model consequences, and that it may be useful for compilers to optimize such operations.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"603 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131966555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rank idle time prediction driven last-level cache writeback","authors":"Zhe Wang, S. Khan, Daniel A. Jiménez","doi":"10.1145/2247684.2247690","DOIUrl":"https://doi.org/10.1145/2247684.2247690","url":null,"abstract":"In modern DDRx memory systems, memory write requests can cause significant performance loss by increasing the memory access latency for subsequent read requests targeting the same device. In this paper, we propose a rank idle time prediction driven last-level cache writeback technique. This technique uses a rank idle time predictor to predict long phases of idle rank cycles. The scheduled dirty cache blocks generated from last-level cache are written back during the predicted long idle rank period. This technique allows servicing write request at the point that minimize the delay it caused to the following read requests. Write-induced interference can be significantly reduced by using our technique.\u0000 We evaluate our technique using cycle-accurate full-system simulator and SPEC CPU2006 benchmarks. The results shows the technique improves performance in an eight-core system with memory-intensive workloads on average by 10.5% and 10.1% over conventional writeback using two-rank and four-rank DRAM configurations respectively.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123734654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew Davis, P. Schachte, Z. Somogyi, H. Søndergaard
{"title":"Towards region-based memory management for Go","authors":"Matthew Davis, P. Schachte, Z. Somogyi, H. Søndergaard","doi":"10.1145/2247684.2247695","DOIUrl":"https://doi.org/10.1145/2247684.2247695","url":null,"abstract":"Region-based memory management aims to lower the cost of deallocation through bulk processing: instead of recovering the memory of each object separately, it recovers the memory of a region containing many objects. It relies on static analysis to determine the set of memory regions needed by a program, the program points at which each region should be created and removed, and, for each memory allocation, the region that should supply the memory. The concurrent language Go has features that pose interesting challenges for this analysis. We present a novel design for region-based memory management for Go, combining static analysis, to guide region creation, and lightweight runtime bookkeeping, to help control reclamation. The main advantage of our approach is that it greatly limits the amount of re-work that must be done after each change to the program source code, making our approach more practical than existing RBMM systems. Our prototype implementation covers most of the sequential fragment of Go, and preliminary results are encouraging.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123899872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A higher order theory of locality","authors":"C. Ding, Xiaoya Xiang","doi":"10.1145/2247684.2247697","DOIUrl":"https://doi.org/10.1145/2247684.2247697","url":null,"abstract":"This short paper outlines a theory for deriving the traditional metrics of miss rate and reuse distance from a single measure called the footprint. It gives the correctness condition and discusses the uses of the new theory in on-line locality analysis and multicore cache management.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126690551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel memory defragmentation on a GPU","authors":"R. Veldema, M. Philippsen","doi":"10.1145/2247684.2247693","DOIUrl":"https://doi.org/10.1145/2247684.2247693","url":null,"abstract":"High-throughput memory management techniques such as malloc/free or mark-and-sweep collectors often exhibit memory fragmentation leaving allocated objects interspersed with free memory holes. Memory defragmentation removes such holes by moving objects around in memory so that they become adjacent (compaction) and holes can be merged (coalesced) to form larger holes. However, known defragmentation techniques are slow. This paper presents a parallel solution to best-effort partial defragmentation that makes use of all available cores. The solution not only speeds up defragmentation times significantly, but it also scales for many simple cores. It can therefore even be implemented on a GPU.\u0000 One problem with compaction is that it requires all references to moved objects to be retargeted to point to their new locations. This paper further improves existing work by a better identification of the parts of the heap that contain references to objects moved by the compactor and only processes these parts to find the references that are then retargeted in parallel.\u0000 To demonstrate the performance of the new memory defragmentation algorithm on many-core processors, we show its performance on a modern GPU. Parallelization speeds up compaction 40 times and coalescing up to 32 times. After compaction, our algorithm only needs to process 2%--4% of the total heap to retarget references.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"447 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133281312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of pure methods using garbage collection","authors":"Erik Österlund, Welf Löwe","doi":"10.1145/2247684.2247694","DOIUrl":"https://doi.org/10.1145/2247684.2247694","url":null,"abstract":"Parallelization and other optimizations often depend on static dependence analysis. This approach requires methods to be independent regardless of the input data, which is not always the case.\u0000 Our contribution is a dynamic analysis \"guessing\" if methods are pure, i. e., if they do not change state. The analysis is piggybacking on a garbage collector, more specifically, a concurrent, replicating garbage collector. It guesses whether objects are immutable by looking at actual mutations observed by the garbage collector. The analysis is essentially for free. In fact, our concurrent garbage collector including analysis outperforms Boehm's stop-the-world collector (without any analysis), as we show in experiments. Moreover, false guesses can be rolled back efficiently.\u0000 The results can be used for just-in-time parallelization allowing an automatic parallelization of methods that are pure over certain periods of time. Hence, compared to parallelization based on static dependence analysis, more programs potentially benefit from parallelization.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131011652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can parallel data structures rely on automatic memory managers?","authors":"E. Petrank","doi":"10.1145/2247684.2247685","DOIUrl":"https://doi.org/10.1145/2247684.2247685","url":null,"abstract":"The complexity of parallel data structures is often measured by two major factors: the throughput they provide and the progress they guarantee. Progress guarantees are particularly important for systems that require responsiveness such as real-time systems, operating systems, interactive systems, etc. Notions of progress guarantees such as lock-freedom, wait-freedom, and obstruction-freedom that provide different levels of guarantees have been proposed in the literature [4, 6]. Concurrent access (and furthermore, optimistic access) to shared objects makes the management of memory one of the more complex aspects of concurrent algorithms design. The use of automatic memory management greatly simplifies such algorithms [11, 3, 2, 9]. However, while the existence of lock-free garbage collection has been demonstrated [5], the existence of a practical automatic memory manager that supports lock-free or wait-free algorithms is still open. Furthermore, known schemes for manual reclamation of unused objects are difficult to use and impose a significant overhead on the execution [10].\u0000 It turns out that the memory management community is not fully aware of how dire the need is for memory managers that support progress guarantees for the design of concurrent data structures. Likewise, designers of concurrent data structures are not always aware of the fact that memory management with support for progress guarantees is not available. Closing this gap between these two communities is a major open problem for both communities.\u0000 In this talk we will examine the memory management needs of concurrent algorithms. Next, we will discuss how state-of-the-art research and practice deal with the fact that an important piece of technology is missing (e.g., [7, 1]). Finally, we will survey the currently available pieces in this puzzle (e.g., [13, 12, 8]) and specify which pieces are missing. This open problem is arguably the greatest challenge facing the memory management community today.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117218114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trace-driven simulation of memory system scheduling in multithread application","authors":"Peng Fei Zhu, Mingyu Chen, Yungang Bao, Licheng Chen, Yongbing Huang","doi":"10.1145/2247684.2247691","DOIUrl":"https://doi.org/10.1145/2247684.2247691","url":null,"abstract":"Along with commercial chip-multiprocessors (CMPs) integrating more and more cores, memory systems are playing an increasingly important role in multithread applications. Currently, trace-driven simulation is widely adopted in memory system scheduling research, since it is faster than execution-driven simulation and does not require data computation. On the contrary, due to the same reason, its trace replay for concurrent thread execution lacks data information and contains only addresses, so misplacement occurs in simulations when the trace of one thread runs ahead or behind others. This kind of distortion can cause remarkable errors during research. As shown in our experiment, trace misplacement causes an error rate of up to 10.22% in the metrics, including weighted IPC speedup, harmonic mean of IPC, and CPI throughput. This paper presents a methodology to avoid trace misplacement in trace-driven simulation and to ensure the accuracy of memory scheduling simulation in multithread applications, thus revealing a reliable means to study inter-thread actions in memory systems.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"515 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116212375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}