Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction最新文献_第2页

Cape: compiler-aided program transformation for HTM-based cache side-channel defense 基于html的缓存侧通道防御的编译器辅助程序转换

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI: 10.1145/3497776.3517778

Rui Zhang, Michael D. Bond, Yinqian Zhang

引用次数: 2

Loner: utilizing the CPU vector datapath to process scalar integer data Loner:利用CPU矢量数据路径处理标量整数数据

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI: 10.1145/3497776.3517767

Armand Behroozi, Sunghyun Park, S. Mahlke

{"title":"Loner: utilizing the CPU vector datapath to process scalar integer data","authors":"Armand Behroozi, Sunghyun Park, S. Mahlke","doi":"10.1145/3497776.3517767","DOIUrl":"https://doi.org/10.1145/3497776.3517767","url":null,"abstract":"Modern CPUs utilize SIMD vector instructions and hardware extensions to accelerate code with data-level parallelism. This allows for high performance gains in select application domains such as image and signal processing. However, general purpose code often lacks data-level parallelism or has complex control and data dependencies, which prevents vectorization. Thus, CPU vector registers and functional units frequently sit idle while the scalar datapath unilaterally executes code. In this paper, we present Loner, a profile-guided compiler methodology for optimizing scalar integer loops using the otherwise idle vector datapath. Loner expands the traditional definition of vectorization by identifying two situations where it is beneficial to perform vector operations with a single data element (\"Loner\" data). In the first, the scalar register file and functional units are overburdened, resulting in unnecessary spill/reload operations and stalls due to structural hazards. In the second, we describe a set of \"vector-amenable\" computation patterns that the vector pipeline naturally executes more efficiently than its scalar counterpart. Loner identifies hot code regions that exhibit either characteristic and offloads a subset of a program's computation graph to the vector datapath for maximum performance. We evaluate Loner on an x86 Whiskey Lake processor using select benchmarks from the SPEC, GAP, and MiBench benchmark suites where it improves performance by 2.64% (geomean) up to 40.28%.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129580198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QRANE: lifting QASM programs to an affine IR QRANE:将QASM程序提升为仿射IR

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI: 10.1145/3497776.3517775

Blake Gerard, T. Grosser, Martin Kong

引用次数: 0

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs 通过约束满足映射函数IR中的并行性:移动gpu卷积的案例研究

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI: 10.1145/3497776.3517777

Naums Mogers, Lu Li, Valentin Radu, Christophe Dubach

{"title":"Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs","authors":"Naums Mogers, Lu Li, Valentin Radu, Christophe Dubach","doi":"10.1145/3497776.3517777","DOIUrl":"https://doi.org/10.1145/3497776.3517777","url":null,"abstract":"Graphics Processing Units (GPUs) are notoriously hard to optimize for manually. What is needed are good automatic code generators and optimizers. Accelerate, Futhark and Lift demonstrated that a functional approach is well suited for this challenge. Lift, for instance, uses a system of rewrite rules with a multi-stage approach. Algorithmic optimizations are first explored, followed by hardware-specific optimizations such as using shared memory and mapping parallelism. While the algorithmic exploration leads to correct transformed programs by construction, it is not necessarily true for the latter phase. Exploiting shared memory and mapping parallelism while ensuring correct synchronization is a delicate balancing act, and is hard to encode in a rewrite system. Currently, Lift relies on heuristics with ad-hoc mechanisms to check for correctness. Although this practical approach eventually produces high-performance code, it is not an ideal state of affairs. This paper proposes to extract parallelization constraints automatically from a functional IR and use a solver to identify valid rewriting. Using a convolutional neural network on a mobile GPU as a use case, this approach matches the performance of the ARM Compute Library GEMM convolution and the TVM-generated kernel consuming between 2.7x and 3.6x less memory on average. Furthermore, a speedup of 12x is achieved over the ARM Compute Library direct convolution implementation.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132807331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MLIR-based code generation for GPU tensor cores 基于mlr的GPU张量核代码生成

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI: 10.1145/3497776.3517770

Navdeep Katel, Vivek Khandelwal, Uday Bondhugula

{"title":"MLIR-based code generation for GPU tensor cores","authors":"Navdeep Katel, Vivek Khandelwal, Uday Bondhugula","doi":"10.1145/3497776.3517770","DOIUrl":"https://doi.org/10.1145/3497776.3517770","url":null,"abstract":"The state-of-the-art in high-performance deep learning today is primarily driven by manually developed libraries optimized and highly tuned by expert programmers using low-level abstractions with significant effort. This effort is often repeated for similar hardware and future ones. In this work, we pursue and evaluate the more modular and reusable approach of using compiler IR infrastructure to generate libraries by encoding all the required optimizations as a sequence of transformations and customized passes on an IR. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), it had been hard to represent and transform computation at various levels of abstraction within a single IR. Using the MLIR infrastructure, we build a transformation and lowering pipeline to automatically generate near-peak performance code for matrix-matrix multiplication (matmul) as well as matmul fused with simple pointwise operators targeting tensor cores on NVIDIA GPUs. On a set of problem sizes ranging from 256 to 16384, our performance evaluation shows that we can obtain performance that is 0.95× to 1.19× and 0.80× to 1.60× of cuBLAS for FP32 and FP16 accumulate respectively on NVIDIA’s Ampere based Geforce 3090 RTX. Furthermore, by allowing the fusion of common pointwise operations with matrix-matrix multiplication, we obtain performance ranging from 0.95× to 1.67× of a cuBLAS-based implementation. Additionally, we present matmul-like examples such as 3-d contraction and batched matmul, which the pipeline can efficiently handle while providing competitive performance. We believe that these results motivate further research and engineering on automatic domain-specific library generation using compiler IR infrastructure for similar specialized accelerators.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117132573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Automating reinforcement learning architecture design for code optimization 用于代码优化的自动化强化学习架构设计

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI: 10.1145/3497776.3517769

Huanting Wang, Zhanyong Tang, Chen Zhang, Jiaqi Zhao, Chris Cummins, Hugh Leather, Z. Wang

{"title":"Automating reinforcement learning architecture design for code optimization","authors":"Huanting Wang, Zhanyong Tang, Chen Zhang, Jiaqi Zhao, Chris Cummins, Hugh Leather, Z. Wang","doi":"10.1145/3497776.3517769","DOIUrl":"https://doi.org/10.1145/3497776.3517769","url":null,"abstract":"Reinforcement learning (RL) is emerging as a powerful technique for solving complex code optimization tasks with an ample search space. While promising, existing solutions require a painstaking manual process to tune the right task-specific RL architecture, for which compiler developers need to determine the composition of the RL exploration algorithm, its supporting components like state, reward, and transition functions, and the hyperparameters of these models. This paper introduces SuperSonic, a new open-source framework to allow compiler developers to integrate RL into compilers easily, regardless of their RL expertise. SuperSonic supports customizable RL architecture compositions to target a wide range of optimization tasks. A key feature of SuperSonic is the use of deep RL and multi-task learning techniques to develop a meta-optimizer to automatically find and tune the right RL architecture from training benchmarks. The tuned RL can then be deployed to optimize new programs. We demonstrate the efficacy and generality of SuperSonic by applying it to four code optimization problems and comparing it against eight auto-tuning frameworks. Experimental results show that SuperSonic consistently improves hand-tuned methods by delivering better overall performance, accelerating the deployment-stage search by 1.75x on average (up to 100x).","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131496587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Graph transformations for register-pressure-aware instruction scheduling 用于寄存器压力感知指令调度的图形转换

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI: 10.1145/3497776.3517771

Ghassan Shobaki, Justin Bassett, Mark Heffernan, Austin Kerbow

{"title":"Graph transformations for register-pressure-aware instruction scheduling","authors":"Ghassan Shobaki, Justin Bassett, Mark Heffernan, Austin Kerbow","doi":"10.1145/3497776.3517771","DOIUrl":"https://doi.org/10.1145/3497776.3517771","url":null,"abstract":"This paper presents graph transformation algorithms for register-pressure-aware instruction scheduling. The proposed transformations add edges to the data dependence graph (DDG) to eliminate solutions that are either redundant or sub-optimal. Register-pressure-aware instruction scheduling aims at balancing two conflicting objectives: maximizing instruction-level parallelism (ILP) and minimizing register pressure (RP). Graph transformations have been previously proposed for the problem of maximizing ILP without considering RP, which is a problem of limited practical value. In the current paper, we extend that work by proposing graph transformations for the RP minimization objective, which is an important objective in practice. Various cost functions are considered for representing RP, and we show that the proposed transformations preserve optimality with respect to each of them. The proposed transformations are used to reduce the size of the solution space before applying a Branch-and-Bound (B&B) algorithm that exhaustively searches for an optimal solution. The proposed transformations and the B&B algorithm were implemented in the LLVM compiler, and their performance was evaluated experimentally on a CPU target and a GPU target. The SPEC CPU2017 floating-point benchmarks were used on the CPU and the PlaidML benchmarks were used on the GPU. The results show that the proposed transformations significantly reduce the compile time while giving approximately the same execution-time performance.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128703846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Memory access scheduling to reduce thread migrations 内存访问调度以减少线程迁移

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI: 10.1145/3497776.3517768

S. Damani, Prithayan Barua, Vivek Sarkar

{"title":"Memory access scheduling to reduce thread migrations","authors":"S. Damani, Prithayan Barua, Vivek Sarkar","doi":"10.1145/3497776.3517768","DOIUrl":"https://doi.org/10.1145/3497776.3517768","url":null,"abstract":"It has been widely observed that data movement is emerging as the primary bottleneck to scalability and energy efficiency in future hardware, especially for applications and algorithms that are not cache-friendly and achieve below 1% of peak performance on today’s systems. The idea of “moving compute to data” has been suggested as one approach to address this challenge. While there are approaches that can achieve this migration in software, hardware support is a promising direction from the perspectives of lower overheads and programmer productivity. Migratory thread architectures migrate lightweight hardware thread contexts to the location of the data instead of transferring data to the requesting processor. However, while transporting thread contexts is cheaper than moving data, thread migrations still incur energy and bandwidth overheads and can be particularly expensive if threads frequently migrate in a ping-pong manner between processors due to poor locality of access. In this paper, we propose Memory Access Scheduling, a new compiler optimization that aims to reduce the number of overall thread migrations when executing a program on migratory thread architectures. Our experiments show performance improvements with a geometric mean speedup of 1.23× for a set of 7 explicitly-parallelized kernels, and of 1.10× for a set of 15 automatically-parallelized kernels. We believe that memory access scheduling will also be an important optimization for other locality-centric architectures that benefit from software thread migrations, such as multi-threaded NUMA architectures.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127860959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Caviar: an e-graph based TRS for automatic code optimization 鱼子酱:一个基于电子图的自动代码优化TRS

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2021-11-23 DOI: 10.1145/3497776.3517781

Smail Kourta, Adel Namani, Fatima Benbouzid-Si Tayeb, K. Hazelwood, Chris Cummins, Hugh Leather, Riyadh Baghdadi

{"title":"Caviar: an e-graph based TRS for automatic code optimization","authors":"Smail Kourta, Adel Namani, Fatima Benbouzid-Si Tayeb, K. Hazelwood, Chris Cummins, Hugh Leather, Riyadh Baghdadi","doi":"10.1145/3497776.3517781","DOIUrl":"https://doi.org/10.1145/3497776.3517781","url":null,"abstract":"Term Rewriting Systems (TRSs) are used in compilers to simplify and prove expressions. State-of-the-art TRSs in compilers use a greedy algorithm that applies a set of rewriting rules in a predefined order (where some of the rules are not axiomatic). This leads to a loss of the ability to simplify certain expressions. E-graphs and equality saturation sidestep this issue by representing the different equivalent expressions in a compact manner from which the optimal expression can be extracted. While an e-graph-based TRS can be more powerful than a TRS that uses a greedy algorithm, it is slower because expressions may have a large or sometimes infinite number of equivalent expressions. Accelerating e-graph construction is crucial for making the use of e-graphs practical in compilers. In this paper, we present Caviar, an e-graph-based TRS for proving expressions within compilers. The main advantage of Caviar is its speed. It can prove expressions much faster than base e-graph TRSs. It relies on three techniques: 1) a technique that stops e-graphs from growing when the goal is reached, called Iteration Level Check; 2) a mechanism that balances exploration and exploitation in the equality saturation algorithm, called Pulsing Caviar; 3) a technique to stop e-graph construction before reaching saturation when a non-provable pattern is detected, called Non-Provable Patterns Detection (NPPD). We evaluate caviar on Halide, an optimizing compiler that relies on a greedy-algorithm-based TRS to simplify and prove its expressions. The proposed techniques allow Caviar to accelerate e-graph expansion for the task of proving expressions. They also allow Caviar to prove expressions that Halide’s TRS cannot prove while being only 0.68x slower. Caviar is publicly available at: https://github.com/caviar-trs/caviar.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124233029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

QSSA: an SSA-based IR for Quantum computing QSSA:量子计算中基于ssa的IR

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2021-09-06 DOI: 10.1145/3497776.3517772

Anurudh Peduri, Siddharth Bhat, T. Grosser

{"title":"QSSA: an SSA-based IR for Quantum computing","authors":"Anurudh Peduri, Siddharth Bhat, T. Grosser","doi":"10.1145/3497776.3517772","DOIUrl":"https://doi.org/10.1145/3497776.3517772","url":null,"abstract":"Quantum computing hardware has progressed rapidly. Simultaneously, there has been a proliferation of programming languages and program optimization tools for quantum computing. Existing quantum compilers use intermediate representations (IRs) where quantum programs are described as circuits. Such IRs fail to leverage existing work on compiler optimizations. In such IRs, it is non-trivial to statically check for physical constraints such as the no-cloning theorem, which states that qubits cannot be copied. We introduce QSSA, a novel quantum IR based on static single assignment (SSA) that enables decades of research in compiler optimizations to be applied to quantum compilation. QSSA models quantum operations as being side-effect-free. The inputs and outputs of the operation are in one-to-one correspondence; qubits cannot be created or destroyed. As a result, our IR supports a static analysis pass that verifies no-cloning at compile-time. The quantum circuit is fully encoded within the def-use chain of the IR, allowing us to leverage existing optimization passes on SSA representations such as redundancy elimination and dead-code elimination. Running our QSSA-based compiler on the QASMBench and IBM Quantum Challenge datasets, we show that our optimizations perform comparably to IBM’s Qiskit quantum compiler infrastructure. QSSA allows us to represent, analyze, and transform quantum programs using the robust theory of SSA representations, bringing quantum compilation into the realm of well-understood theory and practice.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122486393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3