Workshop on Memory System Performance and Correctness最新文献

筛选
英文 中文
Supporting virtual memory in GPGPU without supporting precise exceptions 支持GPGPU中的虚拟内存,但不支持精确异常
Workshop on Memory System Performance and Correctness Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247698
Hyesoon Kim
{"title":"Supporting virtual memory in GPGPU without supporting precise exceptions","authors":"Hyesoon Kim","doi":"10.1145/2247684.2247698","DOIUrl":"https://doi.org/10.1145/2247684.2247698","url":null,"abstract":"Supporting precise exceptions has been one of the essential components of designing modern out-of-order processors. It allows handling exception routines, including virtual memory support and also supports debugging features. However, GPGPU, one of the recent popular scientific computing platforms, does not support precise exceptions. Here, in this paper, we argue that supporting precise exceptions is not essential for GPGPUs and we propose an alternate solution to provide virtual memory support without supporting precise exceptions.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127014212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis 通过重用距离分析确定基于循环的并行程序的最佳多核缓存层次结构
Workshop on Memory System Performance and Correctness Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247687
Meng-Ju Wu, D. Yeung
{"title":"Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis","authors":"Meng-Ju Wu, D. Yeung","doi":"10.1145/2247684.2247687","DOIUrl":"https://doi.org/10.1145/2247684.2247687","url":null,"abstract":"Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform extensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114475522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Defensive loop tiling for multi-core processor 多核处理器的防御循环平铺
Workshop on Memory System Performance and Correctness Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247701
Bin Bao, Xiaoya Xiang
{"title":"Defensive loop tiling for multi-core processor","authors":"Bin Bao, Xiaoya Xiang","doi":"10.1145/2247684.2247701","DOIUrl":"https://doi.org/10.1145/2247684.2247701","url":null,"abstract":"Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy, especially the last level cache (LLC) is shared. In this paper, we show that cache sharing requires special types of tiling depending on the co-run programs. We analyze the reasons for the performance difference and give a defensive strategy that performs consistently the best or near the best. For example, when compared with conservative tiling, which tiles for private cache, the performance of defensive tiling is similar in solo-runs but up to 20% higher in program co-runs, when tested on an Intel multicore processor.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114149712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A study towards optimal data layout for GPU computing 面向GPU计算的最佳数据布局研究
Workshop on Memory System Performance and Correctness Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247699
E. Zhang, Han Li, Xipeng Shen
{"title":"A study towards optimal data layout for GPU computing","authors":"E. Zhang, Han Li, Xipeng Shen","doi":"10.1145/2247684.2247699","DOIUrl":"https://doi.org/10.1145/2247684.2247699","url":null,"abstract":"The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. A recent study shows the promise of eliminating irregular references through runtime thread-data remapping. However, how to efficiently determine the optimal mapping is yet an open question. This paper presents some initial exploration to the question, especially in the dimension of data layout optimization. It describes three algorithms to compute or approximate optimal data layouts for GPU. These algorithms exhibit a spectrum of tradeoff among the space cost, time cost, and quality of the resulting data layouts.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127813895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Let there be light!: the future of memory systems is photonics and 3D stacking 要有光!存储系统的未来是光子学和3D堆叠
Workshop on Memory System Performance and Correctness Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988926
K. Bergman, G. Hendry, Paul H. Hargrove, J. Shalf, B. Jacob, K. Hemmert, Arun Rodrigues, D. Resnick
{"title":"Let there be light!: the future of memory systems is photonics and 3D stacking","authors":"K. Bergman, G. Hendry, Paul H. Hargrove, J. Shalf, B. Jacob, K. Hemmert, Arun Rodrigues, D. Resnick","doi":"10.1145/1988915.1988926","DOIUrl":"https://doi.org/10.1145/1988915.1988926","url":null,"abstract":"Energy consumption is the fundamental barrier to exascale supercomputing and it is dominated by the cost of moving data from one point to another, not computation. Similarly, performance is dominated by data movement, not computation. The solution to this problem requires three critical technologies: 3D integration, optical chip-to-chip communication, and a new communication model. A memory system based on these technologies has the potential to lower the cost of local memory accesses by orders of magnitude and provide substantially more bandwidth. To reach the goals of exascale computing with a manageable power budget, the industry will have to adopt these technologies. Doing so will enable exascale computing, and will have a major worldwide economic impact.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130016231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Deferred gratification: engineering for high performance garbage collection from the get go 延迟满足:从一开始就进行高性能垃圾收集的工程
Workshop on Memory System Performance and Correctness Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988930
Ivan Jibaja, S. Blackburn, M. Haghighat, K. McKinley
{"title":"Deferred gratification: engineering for high performance garbage collection from the get go","authors":"Ivan Jibaja, S. Blackburn, M. Haghighat, K. McKinley","doi":"10.1145/1988915.1988930","DOIUrl":"https://doi.org/10.1145/1988915.1988930","url":null,"abstract":"Implementing a new programming language system is a daunting task. A common trap is to punt on the design and engineering of exact garbage collection and instead opt for reference counting or conservative garbage collection (GC). For example, AppleScript#8482;, Perl, Python, and PHP implementers chose reference counting (RC) and Ruby chose conservative GC. Although easier to get working, reference counting has terrible performance and conservative GC is inflexible and performs poorly when allocation rates are high. However, high performance GC is central to performance for managed languages and only becoming more critical due to relatively lower memory bandwidth and higher memory latency of modern architectures. Unfortunately, retrofitting support for high performance collectors later is a formidable software engineering task due to their exact nature. Whether they realize it or not, implementers have three routes: (1) forge ahead with reference counting or conservative GC, and worry about the consequences later; (2) build the language on top of an existing managed runtime with exact GC, and tune the GC to scripting language workloads; or (3) engineer exact GC from the ground up and enjoy the correctness and performance benefits sooner rather than later.\u0000 We explore this conundrum using PHP, the most popular server side scripting language. PHP implements reference counting, mirroring scripting languages before it. Because reference counting is incomplete, the implementors must (a) also implement tracing to detect cyclic garbage, or (b) prohibit cyclic data structures, or (c) never reclaim cyclic garbage. PHP chose (a), AppleScript chose (b), and Perl chose (c). We characterize the memory behavior of five typical PHP programs to determine whether their implementation choice was a good one in light of the growing demand for high performance PHP. The memory behavior of these PHP programs is similar to other managed languages, such as Java#8482; ---they allocate many short lived objects, a large variety of object sizes, and the average allocated object size is small. These characteristics suggest copying generational GC will attain high performance.\u0000 Language implementers who are serious about correctness and performance need to understand deferred gratification: paying the software engineering cost of exact GC up front will deliver correctness and memory system performance later.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121129870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
How to fit program footprint curves 如何拟合程序占用曲线
Workshop on Memory System Performance and Correctness Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988920
Xiaoya Xiang, Bin Bao
{"title":"How to fit program footprint curves","authors":"Xiaoya Xiang, Bin Bao","doi":"10.1145/1988915.1988920","DOIUrl":"https://doi.org/10.1145/1988915.1988920","url":null,"abstract":"A footprint is the volume of data accessed in a time window. A complete characterization requires summarizing all footprints in all execution windows. A concise summary is the footprint curve, which gives the average footprint in windows of different lengths. The footprint curve contains information from all footprints. It can be measured in time O(n) for a trace of length n, which is fast enough for most benchmarks.\u0000 In this paper, we outline a study on footprint curves. We propose four curve fitting methods based on the real data observed in SPEC benchmark programs.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128419085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A programming model for deterministic task parallelism 确定性任务并行的编程模型
Workshop on Memory System Performance and Correctness Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988918
Polyvios Pratikakis, H. Vandierendonck, Spyros Lyberis, Dimitrios S. Nikolopoulos
{"title":"A programming model for deterministic task parallelism","authors":"Polyvios Pratikakis, H. Vandierendonck, Spyros Lyberis, Dimitrios S. Nikolopoulos","doi":"10.1145/1988915.1988918","DOIUrl":"https://doi.org/10.1145/1988915.1988918","url":null,"abstract":"The currently dominant programming models to write software for multicore processors use threads that run over shared memory. However, as the core count increases, cache coherency protocols get very complex and ineffective, and maintaining a shared memory abstraction becomes expensive and impractical. Moreover, writing multithreaded programs is notoriously difficult, as the programmer needs to reason about all the possible thread interleavings and interactions, including the myriad of implicit, non-obvious, and often unpredictable thread interactions through shared memory. Overall, as processors get more cores and parallel software becomes mainstream, the shared memory model reaches its limits regarding ease of programming and efficiency.\u0000 This position paper presents two ideas aiming to solve the problem. First, we restrict the way the programmer expresses parallelism: The program is a collection of possibly recursive tasks, where each task is atomic and cannot communicate with any other task during its execution. Second, we relax the requirement for coherent shared memory: Each task defines its memory footprint, and is guaranteed to have exclusive access to that memory during its execution. Using this model, we can then define a runtime system that transparently performs the data transfers required among cores without cache coherency, and also produces a deterministic execution of the program, provably equivalent to its sequential elision.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131781935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain 多种内存架构对多核消费软件的影响:来自电子游戏领域的工业视角
Workshop on Memory System Performance and Correctness Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988925
G. Russell, C. Riley, Neil Henning, Uwe Dolinsky, A. Richards, A. Donaldson, A. V. Amesfoort
{"title":"The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain","authors":"G. Russell, C. Riley, Neil Henning, Uwe Dolinsky, A. Richards, A. Donaldson, A. V. Amesfoort","doi":"10.1145/1988915.1988925","DOIUrl":"https://doi.org/10.1145/1988915.1988925","url":null,"abstract":"Memory architectures need to adapt in order for performance and scalability to be achieved in software for multicore systems. In this paper, we discuss the impact of techniques for scalable memory architectures, especially the use of multiple, non-cache-coherent memory spaces, on the implementation and performance of consumer software. Primarily, we report extensive real-world experience in this area gained by Codeplay Software Ltd., a software tools company working in the area of compilers for video games and GPU software. We discuss the solutions we use to handle variations in memory architecture in consumer software, and the impact such variations have on software development effort and, consequently, development cost. This paper introduces preliminary findings regarding impact on software, in advance of a larger-scale analysis planned over the next few years. The techniques discussed have been employed successfully in the development and optimisation of a shipping AAA cross-platform video game.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116144257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Approximating inclusion-based points-to analysis 近似基于包容的分析点
Workshop on Memory System Performance and Correctness Pub Date : 2011-06-05 DOI: 10.1145/1988915.1988931
R. Nasre
{"title":"Approximating inclusion-based points-to analysis","authors":"R. Nasre","doi":"10.1145/1988915.1988931","DOIUrl":"https://doi.org/10.1145/1988915.1988931","url":null,"abstract":"It has been established that achieving a points-to analysis that is scalable in terms of analysis time typically involves trading off analysis precsision and/or memory. In this paper, we propose a novel technique to approximate the solution of an inclusion-based points-to analysis. The technique is based on intelligently approximating pointer- and location-equivalence across variables in the program. We develop a simple approximation algorithm based on the technique. By exploiting various behavioral properties of the solution, we develop another improved algorithm which implements various optimizations related to the merging order, proximity search, lazy merging and identification frequency. The improved algorithm provides a strong control to the client to trade off analysis time and precision as per its requirements. Using a large suite of programs including SPEC 2000 benchmarks and five large open source programs, we show how our algorithm helps achieve a scalable solution.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"48 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123560932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信