{"title":"On the importance of optimizing the configuration of stream prefetchers","authors":"I. Ganusov, Martin Burtscher","doi":"10.1145/1111583.1111591","DOIUrl":"https://doi.org/10.1145/1111583.1111591","url":null,"abstract":"This paper provides a detailed analysis of how the parameters of hardware prefetchers affect the memory system performance. In particular, we found the configuration of the frequently used stream prefetcher to have a major impact on the runtime, making parameter optimizations imperative when comparing a stream prefetcher with other prefetching techniques. For example, we show that adjusting the prefetch distance to the optimal value can increase the average speedup over the SPECcpu2000 benchmark suite from 40% to 70%. Moreover, our investigation of the performance of runahead prefetching relative to stream prefetching shows that choosing a non-optimal stream prefetcher as a baseline can distort the results by as much as a factor of two.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115057967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Ding, Chengliang Zhang, Xipeng Shen, M. Ogihara
{"title":"Gated memory control for memory monitoring, leak detection and garbage collection","authors":"C. Ding, Chengliang Zhang, Xipeng Shen, M. Ogihara","doi":"10.1145/1111583.1111593","DOIUrl":"https://doi.org/10.1145/1111583.1111593","url":null,"abstract":"In the past, program monitoring often operates at the code level, performing checks at function and loop boundaries. Recent research shows that profiling analysis can identify high-level phases in complex binary code. Examples are time steps in scientific simulations and service cycles in utility programs. Because of their larger size and more predictable behavior, program phases make it possible for more accurate and longer term predictions of program behavior, especially its memory usage. This paper describes a new approach that uses phase boundaries as the gates to monitor and control the memory usage. In particular, it presents three techniques: memory usage monitoring, object lifetime classification, and preventive memory management. They use phase-level patterns to predict the trend of the program's memory demand, identify and control memory leaks, improve the efficiency of garbage collection. The potential of the new techniques is demonstrated on two non-trivial applications---a C compiler and a Lisp interpreter.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122013394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recursive data structure profiling","authors":"Easwaran Raman, David I. August","doi":"10.1145/1111583.1111585","DOIUrl":"https://doi.org/10.1145/1111583.1111585","url":null,"abstract":"As the processor-memory performance gap increases, so does the need for aggressive data structure optimizations to reduce memory access latencies. Such optimizations require a better understanding of the memory behavior of programs. We propose a profiling technique called Recursive Data Structure Profiling to help better understand the memory access behavior of programs that use recursive data structures (RDS) such as lists, trees, etc. An RDS profile captures the runtime behavior of the individual instances of recursive data structures. RDS profiling differs from other memory profiling techniques in its ability to aggregate information pertaining to an entire data structure instance, rather than merely capturing the behavior of individual loads and stores, thereby giving a more global view of a program's memory accesses.This paper describes a method for collecting RDS profile without requiring any high-level program representation or type information. RDS profiling achieves this with manageable space and time overhead on a mixture of pointer intensive benchmarks from the SPEC, Olden and other benchmark suites. To illustrate the potential of the RDS profile in providing a better understanding of memory accesses, we introduce a metric to quantify the notion of stability of an RDS instance. A stable RDS instance is one that undergoes very few changes to its structure between its initial creation and final destruction, making it an attractive candidate to certain data structure optimizations.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"175 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133391390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A locality-improving dynamic memory allocator","authors":"Yi Feng, E. Berger","doi":"10.1145/1111583.1111594","DOIUrl":"https://doi.org/10.1145/1111583.1111594","url":null,"abstract":"In general-purpose applications, most data is dynamically allocated. The memory manager therefore plays a crucial role in application performance by determining the spatial locality of heap objects. Previous general-purpose allocators have focused on reducing fragmentation, while most locality-improving allocators have either focused on improving the locality of the allocator (not the application), or required programmer hints or profiling to guide object placement. We present a high-performance memory allocator called Vam that transparently improves both cache-level and page-level locality of the application while achieving low fragmentation. Over a range of large-footprint benchmarks, Vam improves application performance by an average of 4%-8% versus the Lea (Linux) and FreeBSD allocators. When memory is scarce, Vam improves application performance by up to 2X compared to the FreeBSD allocator, and by over 10X compared to the Lea allocator.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115079195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance characteristics of MAUI: an intelligent memory system architecture","authors":"J. Teller, C. B. Silio, B. Jacob","doi":"10.1145/1111583.1111590","DOIUrl":"https://doi.org/10.1145/1111583.1111590","url":null,"abstract":"Combining ideas from several previous proposals, such as Active Pages, DIVA, and ULMT, we present the Memory Arithmetic Unit and Interface (MAUI) architecture. Because the \"intelligence\" of the MAUI intelligent memory system architecture is located in the memory-controller, logic and DRAM are not required to be integrated into a single chip, and use of off-the-shelf DRAMs is permitted. The MAUI's computational engine performs memory-bound SIMD computations close to the memory system, enabling more efficient memory pipelining. A simulator modeling the MAUI architecture was added to the SimpleScalar v4.0 tool-set. Not surprisingly, simulations show that application speedup increases as the memory system speed increases and the dataset size increases. Simulation results show single-threaded application speedup of over 100% is possible, and suggest that a total system speedup of about 300% is possible in a multi-threaded environment.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130793923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application analysis using memory pressure","authors":"K. Sudeep, A. Gheith","doi":"10.1145/1111583.1111586","DOIUrl":"https://doi.org/10.1145/1111583.1111586","url":null,"abstract":"As the speeds of microprocessors continue to follow Moore's law, memory speeds keep lagging farther behind so as to make the \"memory wall\" more and more distinct. In order for a processor architect to be able to evaluate the right micro-architectural features for the design, a study of the memory behavior of the applications becomes essential. In this paper we present a new metric termed \"memory pressure\" that can be used to analyze the application's behavior and quantify the demand an application places on the memory subsystem. Memory pressure is characterized by four metrics: (1) value-computation-to-use delay, (2)condition-resolution-to-use delay, (3) address-computation-to-use delay, and (4) value-load-to-use delay. It acts as an indicator of the opportunity that caching, prefetching, speculative loads or other DRAM latency hiding techniques can provide to improve the performance of the application. We have analyzed a few synthetic benchmarks as well as a few scientific applications and have been able to identify the benefit of caches and prefetch techniques for these benchmarks. As we demonstrate in this paper, quantifying the memory pressure not only provides insight into which architectural features a designer should evaluate for optimal performance, but also provides tangible hints to the software designer to make changes to the application -- algorithmic and structural -- to improve the performance.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126060952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Kamil, P. Husbands, L. Oliker, J. Shalf, K. Yelick
{"title":"Impact of modern memory subsystems on cache optimizations for stencil computations","authors":"S. Kamil, P. Husbands, L. Oliker, J. Shalf, K. Yelick","doi":"10.1145/1111583.1111589","DOIUrl":"https://doi.org/10.1145/1111583.1111589","url":null,"abstract":"In this work we investigate the impact of evolving memory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of scientific applications from simple Jacobi iterations to complex multigrid and block structured adaptive PDE solvers. First we develop a simple benchmark to evaluate the effectiveness of prefetching in cache-based memory systems. Next we present a small parameterized probe and validate its use as a proxy for general stencil computations on three modern microprocessors. We then derive an analytical memory cost model for quantifying cache-blocking behavior and demonstrate its effectiveness in predicting the stencil-computation performance. Overall results demonstrate that recent trends memory system organization have reduced the efficacy of traditional cache-blocking optimizations.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129221672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparent pointer compression for linked data structures","authors":"Chris Lattner, Vikram S. Adve","doi":"10.1145/1111583.1111587","DOIUrl":"https://doi.org/10.1145/1111583.1111587","url":null,"abstract":"64-bit address spaces are increasingly important for modern applications, but they come at a price: pointers use twice as much memory, reducing the effective cache capacity and memory bandwidth of the system (compared to 32-bit address spaces). This paper presents a sophisticated, automatic transformation that shrinks pointers from 64-bits to 32-bits. The approach is \"macroscopic,\" i.e., it operates on an entire logical data structure in the program at a time. It allows an individual data structure instance or even a subset thereof to grow up to 232 bytes in size, and can compress pointers to some data structures but not others. Together, these properties allow efficient usage of a large (64-bit) address space. We also describe (but have not implemented) a dynamic version of the technique that can transparently expand the pointers in an individual data structure if it exceeds the 4GB limit. For a collection of pointer-intensive benchmarks, we show that the transformation reduces peak heap sizes substantially by (20% to 2x) for several of these benchmarks, and improves overall performance significantly in some cases.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133114330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving trace cache hit rates using the sliding window fill mechanism and fill select table","authors":"M. Shaaban, Edward Mulrane","doi":"10.1145/1065895.1065902","DOIUrl":"https://doi.org/10.1145/1065895.1065902","url":null,"abstract":"As superscalar processors become increasingly wide, it is inevitable that the large set of instructions to be fetched every cycle will span multiple noncontiguous basic blocks. The mechanism to fetch, align, and pass this set of instructions down the pipeline must do so as efficiently as possible. The concept of trace cache has emerged as the most promising technique to meet this high-bandwidth, low-latency fetch requirement. A new fill unit scheme, the Sliding Window Fill Mechanism, is proposed as a method to efficiently populate the trace cache. This method exploits trace continuity and identifies probable start regions to improve trace cache hit rate. Simulation yields a 7% average hit rate increase over the Rotenberg fill mechanism. When combined with branch promotion, trace cache hit rates experienced a 19% average increase along with a 17% average improvement in fetch bandwidth.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130701457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Yi, K. Kennedy, Haihang You, Keith Seymour, J. Dongarra
{"title":"Automatic blocking of QR and LU factorizations for locality","authors":"Qing Yi, K. Kennedy, Haihang You, Keith Seymour, J. Dongarra","doi":"10.1145/1065895.1065898","DOIUrl":"https://doi.org/10.1145/1065895.1065898","url":null,"abstract":"QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131669264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}