2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)最新文献_第3页

ELFies: Executable Region Checkpoints for Performance Analysis and Simulation ELFies:用于性能分析和仿真的可执行区域检查点

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI: 10.1109/CGO51591.2021.9370340

H. Patil, Alexander Isaev, W. Heirman, Alen Sabu, Ali Hajiabadi, Trevor E. Carlson

引用次数: 9

MLIR: Scaling Compiler Infrastructure for Domain Specific Computation MLIR:扩展特定领域计算的编译器基础结构

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI: 10.1109/CGO51591.2021.9370308

Chris Lattner, M. Amini, Uday Bondhugula, Albert Cohen, Andy Davis, J. Pienaar, River Riddle, T. Shpeisman, Nicolas Vasilache, O. Zinenko

{"title":"MLIR: Scaling Compiler Infrastructure for Domain Specific Computation","authors":"Chris Lattner, M. Amini, Uday Bondhugula, Albert Cohen, Andy Davis, J. Pienaar, River Riddle, T. Shpeisman, Nicolas Vasilache, O. Zinenko","doi":"10.1109/CGO51591.2021.9370308","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370308","url":null,"abstract":"This work presents MLIR, a novel approach to building reusable and extensible compiler infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous hardware, significantly reducing the cost of building domain specific compilers, and connecting existing compilers together. MLIR facilitates the design and implementation of code generators, translators and optimizers at different levels of abstraction and across application domains, hardware targets and execution environments. The contribution of this work includes (1) discussion of MLIR as a research artifact, built for extension and evolution, while identifying the challenges and opportunities posed by this novel design, semantics, optimization specification, system, and engineering. (2) evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers-describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture. The paper also presents the rationale for MLIR, its original design principles, structures and semantics.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116907031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 190

BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++ BuildIt:用于c++代码生成的基于类型的多阶段编程框架

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI: 10.1109/CGO51591.2021.9370333

Ajay Brahmakshatriya, Saman P. Amarasinghe

{"title":"BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++","authors":"Ajay Brahmakshatriya, Saman P. Amarasinghe","doi":"10.1109/CGO51591.2021.9370333","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370333","url":null,"abstract":"The simplest implementation of a domain-specific language is to embed it in an existing language using operator overloading. This way, the DSL can inherit parsing, syntax and type checking, error handling, and the toolchain of debuggers and IDEs from the host language. A natural host language choice for most high-performance DSLs is the de-facto highperformance language, C++. However, DSL designers quickly run into the problem of not being able to extract control flows due to a lack of introspection in C++ and have to resort to special functions with lambdas to represent loops and conditionals. This approach introduces unnecessary syntax and does not capture the side effects of updates inside the lambdas in a safe way. We present BuildIt, a type-based multi-stage execution framework that solves this problem by extracting all control flow operators like if-then-else conditionals and for and while loops using a pure library approach. BuildIt achieves this by repeated execution of the program to explore all control flow paths and construct the AST piece by piece. We show that BuildIt can do this without exponential blow-up in terms of output size and execution time. We apply BuildIt's staging capabilities to the state-of-the-art tensor compiler TACO to generate low-level IR for custom-level formats. Thus, BuildIt offers a way to get both generalization and programmability for the user while generating specialized and efficient code. We also demonstrate that BuildIt can generate rich control-flow from relatively simple code by using it to stage an interpreter for an esoteric language. BuildIt changes the way we think about multi-staging as a problem by reducing the PL complexity of a supposedly harder problem requiring features like introspection or specialized compiler support to a set of common features found in most languages.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133265511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Variable-Sized Blocks for Locality-Aware SpMV 位置感知SpMV的可变大小块

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI: 10.1109/CGO51591.2021.9370327

Naveen Namashivavam, Sanyam Mehta, P. Yew

引用次数: 6

Compiling Graph Applications for GPU s with GraphIt 使用GraphIt编译GPU的图形应用程序

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI: 10.1109/CGO51591.2021.9370321

Ajay Brahmakshatriya, Saman P. Amarasinghe

{"title":"Compiling Graph Applications for GPU s with GraphIt","authors":"Ajay Brahmakshatriya, Saman P. Amarasinghe","doi":"10.1109/CGO51591.2021.9370321","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370321","url":null,"abstract":"The performance of graph programs depends highly on the algorithm, the size and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all settings. To achieve high performance, the programmer must carefully select which set of optimizations and hardware platforms to use. The GraphIt programming language makes it easy for the programmer to write the algorithm once and optimize it for different inputs using a scheduling language. However, GraphIt currently has no support for generating highperformance code for GPUs. Programmers must resort to re-implementing the entire algorithm from scratch in a low-level language with an entirely different set of abstractions and optimizations in order to achieve high performance on GPUs. We propose G2, an extension to the GraphIt compiler framework, that achieves high performance on both CPUs and GPUs using the same algorithm specification. G2 significantly expands the optimization space of GPU graph processing frameworks with a novel GPU scheduling language and compiler that enables combining load balancing, edge traversal direction, active vertexset creation, active vertexset processing ordering, and kernel fusion optimizations. G2 also introduces two performance optimizations, Edge-based Thread Warps CTAs load balancing (ETWC) and EdgeBlocking, to expand the optimization space for GPUs. ETWC improves load balancing by dynamically partitioning the edges of each vertex into blocks that are assigned to threads, warps, and CTAs for execution. EdgeBlocking improves the locality of the program by reordering the edges and restricting random memory accesses to fit within the L2 cache. We evaluate G2 on 5 algorithms and 9 input graphs on both Pascal and Volta generation NVIDIA GPUs, and show that it achieves up to 5.11× speedup over state-of-the-art GPU graph processing frameworks, and is the fastest on 66 out of the 90 experiments.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133673784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation 增强跨isa动态二进制转换的原子指令仿真

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI: 10.1109/CGO51591.2021.9370312

Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, P. Yew

{"title":"Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation","authors":"Ziyi Zhao, Zhang Jiang, Ying Chen, Xiaoli Gong, Wenwen Wang, P. Yew","doi":"10.1109/CGO51591.2021.9370312","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370312","url":null,"abstract":"Dynamic Binary Translation (DBT) is a key enabler for cross-ISA emulation, system virtualization, runtime instrumentation, and many other important applications. Among several critical requirements for DBT, it is important to provide equivalent semantics for atomic synchronization instructions such as Load - Link / Store - Conditional (LL/SC), which are mostly included in the reduced-instruction set architectures (RISC) and Compare-and-Swap(CAS), which is mostly in the complex instruction set architectures (CISC). However, the state-of-the-art DBT tools often do not provide a fully correct translation of these atomic instructions, in particular, from RISC atomic instructions (i.e. LL/SC) to CISC atomic instructions (i.e. CAS), due to performance concerns. As a result, some may cause the well-known ABA problem, which could lead to wrong results or program crashes. In our experimental studies on QEMU, a state-of-the-art DBT, that runs multi-threaded lock-free stack operations implemented with ARM instruction set (i.e. using LL/SC) on Intel x86 platforms (i.e. using CAS), it often crashes within 2 seconds. Although attempts have been made to provide correct emulation for such atomic instructions, they either result in heavy execution overheads or require additional hardware support. In this paper, we propose several schemes to address those issues and implement them on QEMU to evaluate their performance overheads. The results show that all of the proposed schemes can provide correct emulation and, for the best solution, can achieve a min, max, geomean speedup of 1.25x, 3.21x, 2.03x respectively, over the best existing software-based scheme.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122099647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction ANGHABENCH:一个具有一百万可编译C基准的套件，用于减少代码大小

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI: 10.1109/CGO51591.2021.9370322

A. F. Silva, Jerônimo Nunes Rocha, B. Guimarães, Fernando Magno Quintão Pereira

{"title":"ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction","authors":"A. F. Silva, Jerônimo Nunes Rocha, B. Guimarães, Fernando Magno Quintão Pereira","doi":"10.1109/CGO51591.2021.9370322","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370322","url":null,"abstract":"A predictive compiler uses properties of a program to decide how to optimize it. The compiler is trained on a collection of programs to derive a model which determines its actions in face of unknown codes. One of the challenges of predictive compilation is how to find good training sets. Regardless of the programming language, the availability of human-made benchmarks is limited. Moreover, current synthesizers produce code that is very different from actual programs, and mining compilable code from open repositories is difficult, due to program dependencies. In this paper, we use a combination of web crawling and type inference to overcome these problems for the C programming language. We use a type reconstructor based on Hindley-Milner's algorithm to produce ANGHABENCH, a virtually unlimited collection of real-world compilable C programs. Although ANGHABENCH programs are not executable, they can be transformed into object files by any C compliant compiler. Therefore, they can be used to train compilers for code size reduction. We have used thousands of ANGHABENCH programs to train YACOS, a predictive compiler based on LLVM. The version of YACOS autotuned with ANGHABENCH generates binaries for the LLVM test suite over 10% smaller than clang -Oz. It compresses code impervious even to the state-of-the-art Function Sequence Alignment technique published in 2019, as it does not require large binaries to work well.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126561948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

CGO 2021 Organization CGO 2021组织

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-02-27 DOI: 10.1109/cgo51591.2021.9370318

引用次数: 0

C-for-Metal: High Performance Simd Programming on Intel GPUs c for metal: Intel gpu上的高性能Simd编程

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-01-26 DOI: 10.1109/CGO51591.2021.9370324

Guei-Yuan Lueh, Kaiyu Chen, Gang Chen, J. Fuentes, Weiyu Chen, Fangwen Fu, Hong Jiang, Hongzheng Li, Daniel Rhee

引用次数: 2

UNIT: Unifying Tensorized Instruction Compilation 单元:统一张紧指令编译

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2021-01-21 DOI: 10.1109/CGO51591.2021.9370330

Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, Tony Nowatzki

{"title":"UNIT: Unifying Tensorized Instruction Compilation","authors":"Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, Tony Nowatzki","doi":"10.1109/CGO51591.2021.9370330","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370330","url":null,"abstract":"Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3 x speedup over Intel oneDNN on an x86 CPU, 1.75x speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13x speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121660886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21