{"title":"The cache behaviour of large lazy functional programs on stock hardware","authors":"N. Nethercote, A. Mycroft","doi":"10.1145/773146.773044","DOIUrl":"https://doi.org/10.1145/773146.773044","url":null,"abstract":"Lazy functional programs behave differently from imperative programs and these differences extend to cache behaviour. We use hardware counters and a simple yet accurate execution cost model to analyse some large Haskell programs on the x86 architecture. The programs do not interact well with modern processors---L2 cache data miss stalls and branch misprediction stalls account for up to 60% and 32% of execution time respectively. Moreover, the program code exhibits little exploitable instruction-level parallelism.We then use simulation to pinpoint cache misses at the instruction level. With this information we apply prefetching to minimise the cost of write misses, speeding up Haskell programs by up to 22%. We conclude with more ideas for changing the Glasgow Haskell Compiler and its garbage collector to improve the cache performance of large programs.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130133010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Stefanovic, Matthew Hertz, S. Blackburn, K. McKinley, J. E. B. Moss
{"title":"Older-first garbage collection in practice: evaluation in a Java Virtual Machine","authors":"D. Stefanovic, Matthew Hertz, S. Blackburn, K. McKinley, J. E. B. Moss","doi":"10.1145/773146.773042","DOIUrl":"https://doi.org/10.1145/773146.773042","url":null,"abstract":"Until recently, the best performing copying garbage collectors used a generational policy which repeatedly collects the very youngest objects, copies any survivors to an older space, and then infrequently collects the older space. A previous study that used garbage-collection simulation pointed to potential improvements by using an Older-First copying garbage collection algorithm. The Older-First algorithm sweeps a fixed-sized window through the heap from older to younger objects, and avoids copying the very youngest objects which have not yet had sufficient time to die. We describe and examine here an implementation of the Older-First algorithm in the Jikes RVM for Java. This investigation shows that Older-First can perform as well as the simulation results suggested, and greatly improves total program performance when compared to using a fixed-size nursery generational collector. We further compare Older-First to a flexible-size nursery generational collector in which the nursery occupies all of the heap that does not contain older objects. In these comparisons, the flexible-nursery collector is occasionally the better of the two, but on average the Older-First collector performs the best.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132932155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A proposal for a new hardware cache monitoring architecture","authors":"M. Schulz, J. Tao, Jürgen Jeitner, Wolfgang Karl","doi":"10.1145/773146.773047","DOIUrl":"https://doi.org/10.1145/773146.773047","url":null,"abstract":"The analysis of the memory access behavior of applications, an essential step for a successful cache optimization, is a complex task. It needs to be supported with appropriate tools and monitoring facilities. Currently, however, users can only rely on either simulation based approaches, which deliver a large degree of detail but are restricted in their applicability, or on hardware counters embedded into processors, which allow to keep track of very few, mostly global events and hence only provide limited data.In this work a proposal for novel hardware monitoring facility is presented which exhibits both the details of traditional simulations and the low--overhead of hardware counters. Like the latter approach, it is also targeted towards an implementation within the processor for a direct and non--intrusive access to caches and memory busses. Unlike traditional counters, however, it delivers a detailed picture of the complete memory access behavior of applications. This is achieved by generating so--called memory access histograms, which show access frequencies in relation to the applications address space. Such spatial memory access information can then be used for efficient program optimization by focusing on the code and data segments which were found to exhibit a poor cache behavior.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127792930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient static analysis algorithm to detect redundant memory operations","authors":"K. Cooper, Li Xu","doi":"10.1145/773146.773049","DOIUrl":"https://doi.org/10.1145/773146.773049","url":null,"abstract":"As memory system performance becomes an increasingly dominant factor in overall system performance, it is important to optimize programs for memory related operations. This paper concerns static analysis to detect redundant memory operations and enable other compiler transformations to remove such redundant operations.We present an extended global value numbering algorithm to detect redundant memory operations. The key of the new algorithm is a novel SSA-based representation for memory state which allows accurate reasoning about memory state. Using this representation, the algorithm can trace values through memory operations to detect equivalence in the same way that it traces them through register-based scalar operations. Thus it discovers both redundancy involving scalar values and redundancy involving memory operations. The redundancy relation detected by the algorithm can then be used by traditional redundancy elimination transformations to remove those redundant operations.Experiments using a suite of realistic applications demonstrate the algorithm is powerful and fast. In practice, it has essentially linear time complexity.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129428715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The performance advantage of applying compression to the memory system","authors":"N. Mahapatra, Jiangjiang Liu, Krishnan Sundaresan","doi":"10.1145/773146.773048","DOIUrl":"https://doi.org/10.1145/773146.773048","url":null,"abstract":"The memory system stores information comprising primarily instructions and data and secondarily address information, such as cache tag fields. It interacts with the processor by supporting related traffic (again comprising addresses, instructions, and data). Continuing exponential growth in processor performance, combined with technology, architecture, and application trends, place enormous demands on the memory system to permit this information storage and exchange at a high-enough performance (i.e., to provide low latency and high bandwidth access to large amounts of information). This paper comprehensively analyzes the redundancy in the information (addresses, instructions, and data) stored and exchanged between the processor and the memory system and evaluates the potential of compression in improving performance of the memory system. Analysis of traces obtained with Sun Microsystems' Shade simulator simulating SPARC executables of nine integer and six floating-point programs in the SPEC CPU2000 benchmark suite yield impressive results. Well-designed compression schemes may provide benefits in performance that far outweigh the extra time and logic for compression and decompression. This will be more so in the future since the speed and size of logic (which will be used to perform compression/decompression) are improving and are projected to improve at a much higher rate compared to those of interconnect (which will be used to communicate the information), both on-chip and off-chip.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127227880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}