Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization最新文献
{"title":"Reactive techniques for controlling software speculation","authors":"C. Zilles, Naveen Neelakantam","doi":"10.1109/CGO.2005.30","DOIUrl":"https://doi.org/10.1109/CGO.2005.30","url":null,"abstract":"Aggressive software speculation holds significant potential, because it enables program transformations to reduce the program's critical path. Like any form of speculation, however, the key to software speculation is employing it only where it is likely to succeed. While mechanisms for controlling hardware speculation (e.g., saturating counters updated after each instance) are well understood, these techniques do not translate directly to software techniques because changing a speculation requires changing the code. As it stands, the dominant software speculation control technique, non-reactive profile-guided optimization, lacks the robustness to support aggressive speculation. The primary thesis of this paper is that software speculation can be made to be robust by adding a reactive controller that can dynamically adjust the speculation. We make two primary observations about such systems: 1) reactive control systems can select behaviors on which to speculate with performance that equals or exceeds self-training, and 2) such control systems are remarkably latency tolerant. Although reactivity is required, it can be done at a low frequency; latencies of hundreds of thousands, or even millions of cycles, can be tolerated for most actions. Together these two characteristics imply that robust aggressive software speculation is a realistic goal.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"33 1","pages":"305-316"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81105237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Das, Jiwei Lu, Howard Chen, Jinpyo Kim, P. Yew, W. Hsu, Dong-yuan Chen
{"title":"Performance of runtime optimization on BLAST","authors":"A. Das, Jiwei Lu, Howard Chen, Jinpyo Kim, P. Yew, W. Hsu, Dong-yuan Chen","doi":"10.1109/CGO.2005.25","DOIUrl":"https://doi.org/10.1109/CGO.2005.25","url":null,"abstract":"Optimization of a real world application BLAST is used to demonstrate the limitations of static and profile-guided optimizations and to highlight the potential of runtime optimization systems. We analyze the performance profile of this application to determine performance bottlenecks and evaluate the effect of aggressive compiler optimizations on BLAST. We find that applying common optimizations (e.g. O3) can degrade performance. Profile guided optimizations do not show much improvement across the board, as current implementations do not address critical performance bottlenecks in BLAST. In some cases, these optimizations lower performance significantly due to unexpected secondary effects of aggressive optimizations. We also apply runtime optimization to BLAST using the ADORE framework. ADORE is able to detect performance bottlenecks and deploy optimizations resulting in performance gains up to 58% on some queries using data cache prefetching.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"11 1","pages":"86-96"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91159508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective adaptive computing environment management via dynamic optimization","authors":"Shiwen Hu, M. Valluri, L. John","doi":"10.1109/CGO.2005.17","DOIUrl":"https://doi.org/10.1109/CGO.2005.17","url":null,"abstract":"To minimize the surging power consumption of microprocessors, adaptive computing environments (ACEs) where microarchitectural resources can be dynamically tuned to match a program's runtime requirement and characteristics are becoming increasingly common. Adaptive computing environments usually have multiple configurable hardware units, necessitating exploration of a large number of combinatorial configurations in order to identify the most energy-efficient configuration. In this paper, we propose a scheme for efficient management of multiple configurable units, utilizing the inherent capabilities of dynamic optimization systems. Most dynamic optimizers typically detect dominant code regions (hotspots). We develop an ACE management scheme where hotpot boundaries are used for phase detection and adaptation. Since hotspots are of variable sizes and are often nested, program phase behavior which is hierarchical in nature is automatically captured in this technique. To demonstrate the usefulness and effectiveness of our framework, we use the proposed framework to dynamically adapt the sizes of L1 data and L2 caches that have different reconfiguration latencies and overheads. Our technique reduces L1D and L2 cache energy consumption by 47% and 58%, while a popular previously proposed technique only achieves reduction of 32% and 52% respectively.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"288 6 1","pages":"63-73"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85461626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maintaining consistency and bounding capacity of software code caches","authors":"Derek Bruening, Saman P. Amarasinghe","doi":"10.1109/CGO.2005.19","DOIUrl":"https://doi.org/10.1109/CGO.2005.19","url":null,"abstract":"Software code caches are becoming ubiquitous, in dynamic optimizers, runtime tool platforms, dynamic translators fast simulators and emulators, and dynamic compilers. Caching frequently executed fragments of code provides significant performance boosts, reducing the overhead of translation and emulation and meeting or exceeding native performance in dynamic optimizers. One disadvantage of caching, memory expansion, can sometimes be ignored when executing a single application. However, as optimizers and translators are applied more and more in production systems, the memory expansion from running multiple applications simultaneously becomes problematic. A second drawback to caching is the added requirement of maintaining consistency between the code cache and the original code. On architectures like IA-32 that do not require explicit application actions when modifying code, detecting code changes is challenging. Again, consistency can be ignored for certain sets of applications, but as caching systems scale up to executing large, modern, complex programs, consistency becomes critical. This paper presents efficient schemes for keeping a software code cache consistent and for dynamically bounding code cache size to match the current working set of the application. These schemes are evaluated in the DynamoRIO runtime code manipulation system, and operate on stock hardware in the presence of multiple threads and dynamic behavior, including dynamically-loaded, generated, and even modified code.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"447 1","pages":"74-85"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82905330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic generation of high-performance trace compressors","authors":"Martin Burtscher, Nana B. Sam","doi":"10.1109/CGO.2005.6","DOIUrl":"https://doi.org/10.1109/CGO.2005.6","url":null,"abstract":"Program execution traces are frequently used in industry and academia. Yet, most trace-compression algorithms have to be re-implemented every time the trace format is changed, which takes time, is error prone, and often results in inefficient solutions. This paper describes and evaluates TCgen, a too that automatically generates portable, customized, high-performance trace compressors. All the user has to do is provide a description of the trace format and select one or more predictors to compress the fields in the trace records. TCgen translates this specification into C source code and optimizes it for the specified trace format and predictor algorithms. On average, the generated code is faster and compresses better than the six other compression algorithms we have tested. For example, a comparison with SBC, one of the best trace-compression algorithms in the current literature, shows that TCgen's synthesized code compresses SPECcpu2000 address traces 23% more, decompresses them 24% faster, and compresses them 1029% faster.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"28 1","pages":"229-240"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74784930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient SIMD code generation for runtime alignment and length conversion","authors":"Peng Wu, A. Eichenberger, Amy Wang","doi":"10.1109/CGO.2005.18","DOIUrl":"https://doi.org/10.1109/CGO.2005.18","url":null,"abstract":"When generating codes for today's multimedia extensions, one of the major challenges is to deal with memory alignment issues. While hand programming still yields best performing SIMD codes, it is both time consuming and error prone. Compiler technology has greatly improved, including techniques that simdize loops with misaligned accesses by automatically rearranging misaligned memory streams in registers. Current techniques are applicable to runtime alignments, but they aggressively reduce the alignment overhead only when all alignments are known at compile time. This paper presents two major enhancements to the state of the art, improving both performance and coverage. First, we propose a novel technique to simdize loops with runtime alignment nearly as efficiently as those with compile-time misalignment. Runtime alignment is pervasive in real applications because it is either part of the algorithms, or it is an artifact of the compiler's inability to extract accurate alignment information from complex applications. Second, we incorporate length conversion operations, e.g., conversions between data of different sizes, into the alignment handling framework. Length conversions are pervasive in multimedia applications where mixed integer types are often used. Supporting length conversion can greatly improve the coverage of simdizable loops. Experimental results indicate that our runtime alignment technique achieves a 19% to 32% speedup increase over prior art for a benchmark stressing the impact of misaligned data. We also demonstrate speedup factors of up to 8.11 for real benchmarks over sequential execution.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"54 1","pages":"153-164"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78657292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy","authors":"Chun Chen, Jacqueline Chame, Mary W. Hall","doi":"10.1109/CGO.2005.10","DOIUrl":"https://doi.org/10.1109/CGO.2005.10","url":null,"abstract":"This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We have developed an initial implementation and applied this approach to two case studies, matrix multiply and Jacobi relaxation. For matrix multiply, our results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS self-tuning library and the hand-tuned vendor BLAS library. Jacobi results also substantially outperform the native compilers.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"206 1","pages":"111-122"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77042802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting unroll factors using supervised classification","authors":"M. Stephenson, Saman P. Amarasinghe","doi":"10.1109/CGO.2005.29","DOIUrl":"https://doi.org/10.1109/CGO.2005.29","url":null,"abstract":"Compilers base many critical decisions on abstracted architectural models. While recent research has shown that modeling is effective for some compiler problems, building accurate models requires a great deal of human time and effort. This paper describes how machine learning techniques can be leveraged to help compiler writers model complex systems. Because learning techniques can effectively make sense of high dimensional spaces, they can be a valuable tool for clarifying and discerning complex decision boundaries. In this work we focus on loop unrolling, a well-known optimization for exposing instruction level parallelism. Using the Open Research Compiler as a testbed, we demonstrate how one can use supervised learning techniques to determine the appropriateness of loop unrolling. We use more than 2,500 loops - drawn from 72 benchmarks - to train two different learning algorithms to predict unroll factors (i.e., the amount by which to unroll a loop) for any novel loop. The technique correctly predicts the unroll factor for 65% of the loops in our dataset, which leads to a 5% overall improvement for the SPEC 2000 benchmark suite (9% for the SPEC 2000 floating point benchmarks).","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"64 1","pages":"123-134"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82362258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Superword-level parallelism in the presence of control flow","authors":"Jaewook Shin, Mary W. Hall, Jacqueline Chame","doi":"10.1109/CGO.2005.33","DOIUrl":"https://doi.org/10.1109/CGO.2005.33","url":null,"abstract":"In this paper, we describe how to extend the concept of superword-level parallelization (SLP), used for multimedia extension architectures, so that it can be applied in the presence of control flow constructs. Superword-level parallelization involves identifying scalar instructions in a large basic block that perform the same operation, and, if dependences do not prevent it, combining them into a superword operation on a multi-word object. A key insight is that we can use techniques related to optimizations for architectures supporting predicated execution, even for multimedia ISAs that do not provide hardware predication. We derive large basic blocks with predicated instructions to which SLP can be applied. We describe how to minimize overheads for superword predicates and re-introduce control flow for scalar operations. We discuss other extensions to SLP to address common features of real multimedia codes. We present automatically-generated performance results on 8 multimedia codes to demonstrate the power of this approach. We observe speedups ranging from 1.97X to 15.07X as compared to both sequential execution and SLP alone.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"2 1","pages":"165-175"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86076481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A progressive register allocator for irregular architectures","authors":"D. Koes, S. Goldstein","doi":"10.1109/CGO.2005.4","DOIUrl":"https://doi.org/10.1109/CGO.2005.4","url":null,"abstract":"Register allocation is one of the most important optimizations a compiler performs. Conventional graph-coloring based register allocators are fast and do well on regular, RISC-like, architectures, but perform poorly on irregular, CISC-like, architectures with few registers and non-orthogonal instruction sets. At the other extreme, optimal register allocators based on integer linear programming are capable of fully modeling and exploiting the peculiarities of irregular architectures but do not scale well. We introduce the idea of a progressive allocator. A progressive allocator finds an initial allocation of quality comparable to a conventional allocator, but as more time is allowed for computation the quality of the allocation approaches optimal. This paper presents a progressive register allocator which uses a multi-commodity network flow model to elegantly represent the intricacies of irregular architectures. We evaluate our allocator as a substitute for gcc 's local register allocation pass.","PeriodicalId":92120,"journal":{"name":"Proceedings of the ... CGO : International Symposium on Code Generation and Optimization. International Symposium on Code Generation and Optimization","volume":"67 1","pages":"269-279"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86668196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}