2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)最新文献

Enabling Fine-Grained Incremental Builds by Making Compiler Stateful 让编译器有状态，实现细粒度增量编译

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444865

Ruobing Han, Jisheng Zhao, Hyesoon Kim

{"title":"Enabling Fine-Grained Incremental Builds by Making Compiler Stateful","authors":"Ruobing Han, Jisheng Zhao, Hyesoon Kim","doi":"10.1109/CGO57630.2024.10444865","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444865","url":null,"abstract":"Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps. Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections. In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations. We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"55 10","pages":"221-232"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation 利用上下文敏感性和伪仪器改造基于采样的 PGO

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444807

Wenlei He, Hongtao Yu, Lei Wang, Taewook Oh

{"title":"Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation","authors":"Wenlei He, Hongtao Yu, Lei Wang, Taewook Oh","doi":"10.1109/CGO57630.2024.10444807","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444807","url":null,"abstract":"The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains. In this paper, we propose CSSPGO, a context-sensitive sampling-based PGO framework with pseudo-instrumentation. CSSPGO offers a more balanced solution to push sampling-based PGO performance closer to instrumentation-based PGO while maintaining minimal operational overhead. It leverages pseudo-instrumentation to improve profile quality without incurring the overhead of traditional instrumentation. It also enriches profile with context-sensitivity to aid more effective optimizations through a novel profiling methodology using synchronized LBR and stack sampling. CSSPGO is now used to optimize over 75% of Meta's data center CPU cycles. Our evaluation with production workloads demonstrates 1%-5% performance improvement on top of state-of-the-art sampling-based PGO.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"63 4","pages":"322-333"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Tensor Algebra Compiler for Sparse Differentiation 用于稀疏微分的张量代数编译器

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444787

Amir Shaikhha, Mathieu Huot, Shideh Hashemian

引用次数: 0

One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution 一个自动机统治一切：超越多个正则表达式的执行

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444810

L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi

{"title":"One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution","authors":"L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi","doi":"10.1109/CGO57630.2024.10444810","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444810","url":null,"abstract":"Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single-and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean $5.99times$ throughput improvement and $4.05times$ speedup against single and multiple parallel FSAs.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"65 6","pages":"193-206"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TapeFlow: Streaming Gradient Tapes in Automatic Differentiation TapeFlow：自动分辨中的梯度磁带流

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444805

Milad Hakimi, Arrvindh Shriraman

{"title":"TapeFlow: Streaming Gradient Tapes in Automatic Differentiation","authors":"Milad Hakimi, Arrvindh Shriraman","doi":"10.1109/CGO57630.2024.10444805","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444805","url":null,"abstract":"Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2-4×. We introduce Tapeflow, a compiler framework to orchestrate and manage the gradient tape. We make three key contributions. i) We introduce the concept of regions, which transforms the tape layout into an array-of-structs format to improve spatial reuse. ii) We schedule the execution into layers and explicitly orchestrate the tape operands using a scratchpad. This reduces the required cache size and on-chip energy. iii) Finally, we stream the tape from the DRAM by organizing it into a FIFO of tiles. The tape operands arrive just-in-time for each layer. Tapeflow, running on the same hardware, outperforms Enzyme, the state-of-the-art compiler, by 1.3-2.5×, reduces on-chip SRAM usage by 5–40 ×, and saves 8× on-chip energy. We demonstrate Tapeflow on a wide range of algorithms written in general-purpose language.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"61 9","pages":"81-92"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Performance Through Control-Flow Unmerging and Loop Unrolling on GPUs 在 GPU 上通过控制流拆分和循环解卷提升性能

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444819

Alnis Murtovi, G. Georgakoudis, K. Parasyris, Chunhua Liao, Ignacio Laguna, Bernhard Steffen

{"title":"Enhancing Performance Through Control-Flow Unmerging and Loop Unrolling on GPUs","authors":"Alnis Murtovi, G. Georgakoudis, K. Parasyris, Chunhua Liao, Ignacio Laguna, Bernhard Steffen","doi":"10.1109/CGO57630.2024.10444819","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444819","url":null,"abstract":"Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors. This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations. We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"56 12","pages":"106-118"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EasyTracker: A Python Library for Controlling and Inspecting Program Execution EasyTracker：用于控制和检查程序执行的 Python 库

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444823

Théo Barollet, C. Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, Fabrice Rastello

{"title":"EasyTracker: A Python Library for Controlling and Inspecting Program Execution","authors":"Théo Barollet, C. Guillon, Manuel Selva, François Broquedis, Florent Bouchez-Tichadou, Fabrice Rastello","doi":"10.1109/CGO57630.2024.10444823","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444823","url":null,"abstract":"Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"62 12","pages":"359-372"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications DrPy：准确定位多层 Python 应用程序中的低效内存使用情况

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444862

Jinku Cui, Qidong Zhao, Yueming Hao, Xu Liu

{"title":"DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications","authors":"Jinku Cui, Qidong Zhao, Yueming Hao, Xu Liu","doi":"10.1109/CGO57630.2024.10444862","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444862","url":null,"abstract":"Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers. In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by application developers and committed to application repositories.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"61 7","pages":"245-257"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis OptiWISE：将采样与仪器相结合，进行粒度 CPI 分析

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444771

Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, Timothy M. Jones

{"title":"OptiWISE: Combining Sampling and Instrumentation for Granular CPI Analysis","authors":"Yuxin Guo, Alex W. Chadwick, Márton Erdős, Utpal Bora, Ilias Vougioukas, Giacomo Gabrielli, Timothy M. Jones","doi":"10.1109/CGO57630.2024.10444771","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444771","url":null,"abstract":"Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities. We present OPTIWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OPTIWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function. We evaluate OPTIWISE to show it has an overhead of 8.1× geomean, and 57× worst case on SPEC CPU2017 benchmarks. Using OPTIWISE, we present case studies of optimizing selected SPEC benchmarks on a modern x86 server processor. The per-instruction CPI metrics quickly reveal problems such as costly mispredicted branches and cache misses, which we use to manually optimize for effective performance improvements.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"57 9","pages":"373-385"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140398681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Energy-Aware Tile Size Selection for Affine Programs on GPUs GPU 上 Affine 程序的能量感知瓦片尺寸选择

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI: 10.1109/CGO57630.2024.10444795

Malith Jayaweera, Martin Kong, Yanzhi Wang, D. Kaeli

引用次数: 0