{"title":"An empirical performance analysis of commodity memories in commodity servers","authors":"D. Kerbyson, M. Lang, G. Patino, Hossein Amidi","doi":"10.1145/1065895.1065903","DOIUrl":"https://doi.org/10.1145/1065895.1065903","url":null,"abstract":"This work details a performance study of six different types of commodity memories in two commodity server nodes. A number of micro-benchmarks are used that measure low-level performance characteristics, as well as two applications representative of the ASC workload. The memories vary both in terms of performance, including latency and bandwidths, and in terms of their physical properties and manufacturer. The two server nodes analyzed were an Itanium-II Madison based system, and a Xeon based system. All memories can be used within both of these processing nodes. This allows the performance of the memories to be directly examined while keeping all other factors within a node the same (processor, motherboard, operating system etc.). The results of this study show that there can be a significant difference in application performance depending on the actual memory used - by as much as 20%. The achieved performance is a result of the integration of the memory into the node as well as how the applications actually utilize it.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116910987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changpeng Fang, S. Carr, Soner Önder, Zhenlin Wang
{"title":"Reuse-distance-based miss-rate prediction on a per instruction basis","authors":"Changpeng Fang, S. Carr, Soner Önder, Zhenlin Wang","doi":"10.1145/1065895.1065906","DOIUrl":"https://doi.org/10.1145/1065895.1065906","url":null,"abstract":"Feedback-directed optimization has become an increasingly important tool in designing and building optimizing compilers. Recently, reuse-distance analysis has shown much promise in predicting the memory behavior of programs over a wide range of data sizes. Reuse-distance analysis predicts program locality by experimentally determining locality properties as a function of the data size of a program, allowing accurate locality analysis when the program's data size changes.Prior work has established the effectiveness of reuse distance analysis in predicting whole-program locality and miss rates. In this paper, we show that reuse distance can also effectively predict locality and miss rates on a per instruction basis. Rather than predict locality by analyzing reuse distances for memory addresses alone, we relate those addresses to particular static memory operations and predict the locality of each instruction.Our experiments show that using reuse distance without cache simulation to predict miss rates of instructions is superior to using cache simulations on a single representative data set to predict miss rates on various data sizes. In addition, our analysis allows us to identify the critical memory operations that are likely to produce a significant number of cache misses for a given data size. With this information, compilers can target cache optimization specifically to the instructions that can benefit from such optimizations most.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129280134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Metrics and models for reordering transformations","authors":"M. Strout, P. Hovland","doi":"10.1145/1065895.1065899","DOIUrl":"https://doi.org/10.1145/1065895.1065899","url":null,"abstract":"Irregular applications frequently exhibit poor performance on contemporary computer architectures, in large part because of their inefficient use of the memory hierarchy. Run-time data, and iteration-reordering transformations have been shown to improve the locality and therefore the performance of irregular benchmarks. This paper describes models for determining which combination of run-time data- and iteration-reordering heuristics will result in the best performance for a given dataset. We propose that the data- and iteration-reordering transformations be viewed as approximating minimal linear arrangements on two separate hypergraphs: a spatial locality hypergraph and a temporal locality hypergraph. Our results measure the efficacy of locality metrics based on these hypergraphs in guiding the selection of data-and iteration-reordering heuristics. We also introduce new iteration- and data-reordering heuristics based on the hypergraph models that result in better performance than do previous heuristics.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127430041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Polar opposites: next generation languages and architectures","authors":"K. McKinley","doi":"10.1145/1065895.1065900","DOIUrl":"https://doi.org/10.1145/1065895.1065900","url":null,"abstract":"Future hardware technology is on a collision course with modern programming languages. Adoption of programming languages is rare and slow, but programmers are now embracing high-level object-oriented languages such as Java and C# due to their software engineering benefits which include (1) fast development through code reuse and garbage collection; (2) ease of maintenance through encapsulation and object-orientation; (3) reduced errors through type safety, pointer disciplines, and garbage collection; and (4) portability. These programs use small methods, dynamic class binding, heavy memory allocation, short-lived objects, and pointer data structures, and thus obscure parallelism, locality, and control flow, in direct conflict with hardware trends.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123361270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instruction combining for coalescing memory accesses using global code motion","authors":"M. Kawahito, H. Komatsu, T. Nakatani","doi":"10.1145/1065895.1065897","DOIUrl":"https://doi.org/10.1145/1065895.1065897","url":null,"abstract":"Instruction combining is an optimization to replace a sequence of instructions with a more efficient instruction yielding the same result in a fewer machine cycles. When we use it for coalescing memory accesses, we can reduce the memory traffic by combining narrow memory references with contiguous addresses into a wider reference for taking advantage of a wide-bus architecture. Coalescing memory accesses can improve performance for two reasons: one by reducing the additional cycles required for moving data from caches to registers and the other by reducing the stall cycles caused by multiple outstanding memory access requests. Previous approaches for memory access coalescing focus only on array access instructions related to loop induction variables, and thus they miss many other opportunities. In this paper, we propose a new algorithm for instruction combining by applying global code motion to wider regions of the given program in search of more potential candidates. We implemented two optimizations for coalescing memory accesses, one combining two 32-bit integer loads and the other combining two single-precision floating-point loads, using our algorithm in the IBM Java™ JIT compiler for IA-64, and evaluated them by measuring the SPECjvm98 benchmark suite. In our experiment, we can improve the maximum performance by 5.5% with little additional compilation time overhead. Moreover, when we replace every declaration of double for an instance variable with float, we can improve the performance by 7.3% for the MolDyn benchmark in the JavaGrande benchmark suite. Our approach can be applied to a variety of architectures and to programming languages besides Java.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129484455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Programmer specified pointer independence","authors":"D. Koes, M. Budiu, Girish Venkataramani","doi":"10.1145/1065895.1065905","DOIUrl":"https://doi.org/10.1145/1065895.1065905","url":null,"abstract":"Good alias analysis is essential in order to achieve high performance on modern processors, yet precise interprocedural analysis does not scale well. We present a source code annotation, #pragma independent, which provides precise pointer aliasing information to the compiler, and describe a tool which highlights the most important and most likely correct locations at which a programmer should insert these annotations. Using this tool we perform a limit study on the effectiveness of pointer independence in improving program performance through improved compilation.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121751144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From simulation to practice: cache performance study of a Prolog system","authors":"Ricardo Lopes, L. Castro, V. S. Costa","doi":"10.1145/773146.773045","DOIUrl":"https://doi.org/10.1145/773146.773045","url":null,"abstract":"Progress in Prolog applications requires ever better performance and scalability from Prolog implementation technology. Most modern Prolog systems are emulator-based. Best performance thus requires both good emulator design and good memory performance. Indeed, Prolog applications can often spend hundreds of megabytes of data, but there is little work on understanding and quantifying the interactions between Prolog programs and the memory architecture of modern computers.In a previous study of Prolog systems we have shown through simulation that Prolog applications usually, but not always, have good locality, both for deterministic and non-deterministic applications. We also showed that performance may strongly depend on garbage collection and on database operations. Our analysis left two questions unanswered: how well do our simulated results holds on actual hardware, and how much did our results depend on a specific configuration? In this work we use several simulation parameters and profiling counters to improve understanding of Prolog applications. We believe that our analysis is of interest to any system implementor who wants to understand his or her own system's memory performance.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130656837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Calculating stack distances efficiently","authors":"G. Almási, Calin Cascaval, D. Padua","doi":"10.1145/773146.773043","DOIUrl":"https://doi.org/10.1145/773146.773043","url":null,"abstract":"This paper1 describes our experience using the stack processing algorithm [6] for estimating the number of cache misses in scientific programs. By using a new data structure and various optimization techniques we obtain instrumented run-times within 50 to 100 times the original optimized run-times of our benchmarks.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121721659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic pool allocation for disjoint data structures","authors":"Chris Lattner, Vikram S. Adve","doi":"10.1145/773146.773041","DOIUrl":"https://doi.org/10.1145/773146.773041","url":null,"abstract":"This paper presents an analysis technique and a novel program transformation that can enable powerful optimizations for entire linked data structures. The fully automatic transformation converts ordinary programs to use pool (aka region) allocation for heap-based data structures. The transformation relies on an efficient link-time interprocedural analysis to identify disjoint data structures in the program, to check whether these data structures are accessed in a type-safe manner, and to construct a Disjoint Data Structure Graph that describes the connectivity pattern within such structures. We present preliminary experimental results showing that the data structure analysis and pool allocation are effective for a set of pointer intensive programs in the Olden benchmark suite. To illustrate the optimizations that can be enabled by these techniques, we describe a novel pointer compression transformation and briefly discuss several other optimization possibilities for linked data structures.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124412545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiler-directed run-time monitoring of program data access","authors":"C. Ding, Y. Zhong","doi":"10.1145/773146.773040","DOIUrl":"https://doi.org/10.1145/773146.773040","url":null,"abstract":"Accurate run-time analysis has been expensive for complex programs, in part because most methods perform on all a data. Some applications require only partial reorganization. An example of this is off-loading infrequently used data from a mobile device. Complete monitoring is not necessary because not all accesses can reach the displaced data. To support partial monitoring, this paper presents a framework that includes a source-to-source C compiler and a run-time monitor. The compiler inserts run-time calls, which invoke the monitor during execution. To be selective, the compiler needs to identify relevant data and their access. It needs to analyze both the content and the location of monitored data. To reduce run-time overhead, the system uses a source-level interface, where the compiler transfers additional program information to reduce the workload of the monitor. The paper describes an implementation for general C programs. It evaluates different levels of data monitoring and their application on an SGI workstation and an Intel PC.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126314195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}