Gadi Haber, M. Klausner, Vadim Eisenberg, Bilha Mendelson, M. Gurevich
{"title":"Optimization opportunities created by global data reordering","authors":"Gadi Haber, M. Klausner, Vadim Eisenberg, Bilha Mendelson, M. Gurevich","doi":"10.1109/CGO.2003.1191548","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191548","url":null,"abstract":"Memory access has proven to be one of the bottlenecks in modern architectures. Improving memory locality and eliminating the amount of memory access can help release this bottleneck. We present a method for link-time profile-based optimization by reordering the global data of the program and modifying its code accordingly. The proposed optimization reorders the entire global data of the program, according to a representative execution rate of each instruction (or basic block) in the code. The data reordering is done in a way that enables the replacement of frequently-executed Load instructions, which reference the global data, with fast Add Immediate instructions. In addition, it tries to improve the global data locality and to reduce the total size of the global data area. The optimization was implemented into FDPR (Feedback Directed Program Restructuring), a post-link optimizer, which is part of the IBM AIX operating system for the IBM pSeries servers. Our results on SPECint2000 show a significant improvement of up to 11% (average 3%) in execution time, along with up to 97.9% (average 83%) reduction in memory references to the global variables via the global data access mechanism of the program.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127230209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inlining of mathematical functions in HP-UX for Itanium/sup /spl reg// 2","authors":"James W. Thomas","doi":"10.1109/CGO.2003.1191540","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191540","url":null,"abstract":"HP-UX compilers inline mathematical functions for Itanium processor family (IPF) systems to improve throughput 4X-8X versus external library calls, achieving speeds comparable to highly tuned vector functions, without requiring the user to code for a vector interface and without sacrificing accuracy or edge-case behaviors. This paper highlights IPF architectural features that support implementation of high-performance, high-quality mathematics functions for inlining. It discusses strategies for utilizing the features and developing inlineable sequences on a large scale, and it presents requisite compiler features and language extensions. Also, this paper describes compiler mechanisms that produce inlineable code and inline it.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134061942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phi-predication for light-weight if-conversion","authors":"Weihaw Chuang, B. Calder, J. Ferrante","doi":"10.1109/CGO.2003.1191544","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191544","url":null,"abstract":"Predicated execution can eliminate hard to predict branches and help to enable instruction level parallelism. Many current predication variants exist where the result update is conditional based upon the outcome of the guarding predicate. However conditional writing of a register creates a naming problem for an out-of-order processor and can stall the issuing of instructions. This problem arises from potential multiple predicated definitions reaching a use, which is unresolved until the prior predicate values are computed. We focus on a light-weight form of predication, phi-predication, where all predicated instructions write a result value to their register regardless of the predicate value (i.e. even if it is false). Therefore, the predicate does not guard the writing of the result register; it instead acts as a form of selection between two input registers. This eliminates the naming problem for an out-of-order processor. Our phi-predicated ISA is derived from the predicated features of the Multiflow ISA, with extensions to efficiently predicate complex control flow. Our compiler modifications also expand upon prior techniques to provide efficient code generation. We examine the use of phi-predication for an in-order and out-of-order architecture and compare its performance to using select-op and IA64 ISA predication.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116504623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic binary translation for accumulator-oriented architectures","authors":"Ho-Seop Kim, James E. Smith","doi":"10.1109/CGO.2003.1191530","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191530","url":null,"abstract":"A dynamic binary translation system for a co-designed virtual machine is described and evaluated. The underlying hardware directly executes an accumulator-oriented instruction set that exposes instruction dependence chains (strands) to a distributed microarchitecture containing a simple instruction pipeline. To support conventional program binaries, a source instruction set (Alpha in our study) is dynamically translated to the target accumulator instruction set. The binary translator identifies chains of inter-instruction dependences and assigns them to dependence-carrying accumulators. Because the underlying superscalar microarchitecture is capable of dynamic instruction scheduling, the binary translation system does not perform aggressive optimizations or re-schedule code; this significantly reduces binary translation overhead. Detailed timing simulation of the dynamically translated code running on an accumulator-based distributed microarchitecture shows the overall system is capable of achieving similar performance to an ideal out-of-order superscalar processor, ignoring the significant clock frequency advantages that the accumulator-based hardware is likely to have. As part of the study, we evaluate an instruction set modification that simplifies precise trap implementation. This approach significantly reduces the number of instructions required for register state copying, thereby improving performance. We also observe that translation chaining methods can have substantial impact on the performance, and we evaluate a number of chaining methods.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127979106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Addressing mode selection","authors":"E. Eckstein, Bernhard Scholz","doi":"10.1109/CGO.2003.1191557","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191557","url":null,"abstract":"Many processor architectures provide a set of addressing modes in their address generation units. For example DSP (digital signal processors) have powerful addressing modes for efficiently implementing numerical algorithms. Typical addressing modes of DSP are auto post-modification and indexing for address registers. The selection of the optimal addressing modes in the means of minimal code size and minimal execution time depends on many parameters and is NP complete in general. In this work we present a new approach for solving the addressing mode selection (AMS) problem. We provide a method for modeling the target architecture's addressing modes as cost functions for a partitioned Boolean quadratic optimization problem (PBQP). For solving the PBQP we present an efficient and effective way to implement large matrices for modeling the cost model. We have integrated the addressing mode selection with the Atair C-Compiler for the uPD7705x DSP from NEC. In our experiments we show that the addressing mode selection can be optimally solved for almost all benchmark programs and the compile-time overhead of the address mode selection is within acceptable bounds for a production DSP compiler.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128109582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicate-aware scheduling: a technique for reducing resource constraints","authors":"M. Smelyanskiy, S. Mahlke, E. Davidson, H. Lee","doi":"10.1109/CGO.2003.1191543","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191543","url":null,"abstract":"Predicated execution enables the removal of branches wherein segments of branching code are converted into straight-line segments of conditional operations. An important, but generally ignored side effect of this transformation is that the compiler must assign distinct resources to all the predicated operations at a given time to ensure that those resources are available at run-time. However, a resource is only put to productive use when the predicates associated with its operations evaluate to True. We propose predicate-aware scheduling to reduce the superfluous commitment of resources to operations whose predicates evaluate to False at run-time. The central idea is to assign multiple operations to the same resource at the same time, thereby oversubscribing its use. This assignment is intelligently performed to ensure that no two operations simultaneously assigned to the same resource will have both of their predicates evaluate to True. Thus, no resource is dynamically oversubscribed. The overall effect of predicate aware scheduling is to use resources more efficiently, thereby increasing performance when resource constraints are a bottleneck.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132501088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic trace selection using performance monitoring hardware sampling","authors":"Howard Chen, W. Hsu, Dong-yuan Chen","doi":"10.1109/CGO.2003.1191535","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191535","url":null,"abstract":"Optimizing programs at run-time provides opportunities to apply aggressive optimizations to programs based on information that was not available at compile time. At run time, programs can be adapted to better exploit architectural features, optimize the use of dynamic libraries, and simplify code based on run-time constants. Our profiling system provides a framework for collecting information required for performing run-time optimization. We sample the performance hardware registers available on an Itanium processor, and select a set of code that is likely to lead to important performance-events. We gather distribution information about the performance-events we wish to monitor, and test our traces by estimating the ability for dynamic patching of a program to execute run-time generated traces. Our results show that we are able to capture 58% of execution time across various SPEC2000 integer benchmarks using our profile and patching techniques on a relatively small number of frequently executed execution paths. Our profiling and detection system overhead increases execution time by only 2-4%.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"525 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124321869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizations to prevent cache penalties for the Intel/spl reg/ Itanium/spl reg/ 2 processor","authors":"J. Collard, Daniel M. Lavery","doi":"10.1109/CGO.2003.1191537","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191537","url":null,"abstract":"This paper describes scheduling optimizations in the Intel/spl reg/ Itanium/spl reg/ compiler to prevent cache penalties due to various micro-architectural effects on the Itanium 2 processor. This paper does not try to improve cache hit rates but to avoid penalties, which probably all processors have in one form or another, even in the case of cache hits. These optimizations make use of sophisticated methods for disambiguation of memory references, and this paper examines the performance improvement obtained by integrating these methods into the cache optimizations.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126495791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design, implementation and evaluation of adaptive recompilation with on-stack replacement","authors":"Stephen J. Fink, Feng Qian","doi":"10.1109/CGO.2003.1191549","DOIUrl":"https://doi.org/10.1109/CGO.2003.1191549","url":null,"abstract":"Modern virtual machines often maintain multiple compiled versions of a method. An on-stack replacement (OSR) mechanism enables a virtual machine to transfer execution between compiled versions, even while a method runs. Relying on this mechanism, the system can exploit powerful techniques to reduce compile time and code space, dynamically de-optimize code, and invalidate speculative optimizations. The paper presents a new, simple, mostly compiler-independent mechanism to transfer execution into compiled code. Additionally, we present enhancements to an analytic model for recompilation to exploit OSR for more aggressive optimization. We have implemented these techniques in Jikes RVM and present a comprehensive evaluation, including a study of fully automatic, online, profile-driven deferred compilation.","PeriodicalId":277590,"journal":{"name":"International Symposium on Code Generation and Optimization, 2003. CGO 2003.","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127914470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}