{"title":"CGO 2021 Sponsors","authors":"","doi":"10.1109/cgo51591.2021.9370302","DOIUrl":"https://doi.org/10.1109/cgo51591.2021.9370302","url":null,"abstract":"","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124364088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Object Versioning for Flow-Sensitive Pointer Analysis","authors":"M. Barbar, Yulei Sui, Shiping Chen","doi":"10.1109/CGO51591.2021.9370334","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370334","url":null,"abstract":"Flow-sensitive points-to analysis provides better precision than its flow-insensitive counterpart. Traditionally performed on the control-flow graph, it incurs heavy analysis overhead. For performance, staged flow-sensitive analysis (SFS) is conducted on a pre-computed def-use (value-flow) graph where points-to sets of variables are propagated across def-use chains sparsely rather than across control-flow in the control-flow graph. SFS makes the propagation of different objects' points-to sets sparse (multiple-object sparsity), however, it suffers from redundant propagation between instructions of the same object's points-to sets (single-object sparsity). The points-to set of an object is often duplicated, resulting in redundant propagation and storage, especially in real-world heap-intensive programs. We notice that a simple graph prelabelling extension can identify much of this redundancy in a pre-analysis. With this pre-analysis, multiple nodes (instructions) in the value-flow graph can share an individual memory object's points-to set rather than each node maintaining its own points-to set for that single object. We present object versioning for flow-sensitive points-to analysis, a finer single-object sparsity technique which maintains the same precision while allowing us to avoid much of the redundancy present in propagating and storing points-to sets. Our experiments conducted on 15 open-source programs, when compared with SFS, show that our approach runs up to 26.22× faster (5.31× on average), and reduces memory usage by up to 5.46× (2.11 × on average).","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115109059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, Lin Gao
{"title":"Scaling Up the IFDS Algorithm with Efficient Disk-Assisted Computing","authors":"Haofeng Li, Haining Meng, Hengjie Zheng, Liqing Cao, Jie Lu, Lian Li, Lin Gao","doi":"10.1109/CGO51591.2021.9370311","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370311","url":null,"abstract":"The IFDS algorithm can be memory-intensive, requiring a memory budget of more than 100 GB of RAM for some applications. The large memory requirements significantly restrict the deployment of IFDS-based tools in practise. To improve this, we propose a disk-assisted solution that drastically reduces the memory requirements of traditional IFDS solvers. Our solution saves memory by 1) recomputing instead of memorizing intermediate analysis data, and 2) swapping in-memory data to disk when memory usages reach a threshold. We implement sophisticated scheduling schemes to swap data between memory and disks efficiently. We have developed a new taint analysis tool, DiskDroid, based on our disk-assisted IFDS solver. Compared to FlowDroid, a state-of-the-art IFDS-based taint analysis tool, for a set of 19 apps which take from 10 to 128 GB of RAM by FlowDroid, DiskDroid can analyze them with less than 10GB of RAM at a slight performance improvement of 8.6%. In addition, for 21 apps requiring more than 128GB of RAM by FlowDroid, DiskDroid can analyze each app in 3 hours, under the same memory budget of 10GB. This makes the tool deployable to normal desktop environments. We make the tool publicly available at https://github.com/HaofLi/DiskDroid.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124398250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Interval Compiler for Sound Floating-Point Computations","authors":"Joao Rivera, F. Franchetti, Markus Püschel","doi":"10.1109/CGO51591.2021.9370307","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370307","url":null,"abstract":"Floating-point arithmetic is widely used by software developers but is unsound, i.e., there is no guarantee on the accuracy obtained, which can be imperative in safety-critical applications. We present IGen, a source-to-source compiler that translates a given C function using floating-point into an equivalent sound C function that uses interval arithmetic. IGen supports Intel SIMD intrinsics in the input function using a specially designed code generator and can produce SIMD-optimized output. To mitigate a possible loss of accuracy due to the increase of interval sizes, IGen can compile to double-double precision, again SIMD-optimized. Finally, IGen implements an accuracy optimization for the common reduction pattern. We benchmark our compiler on high-performance code in the domain of linear algebra and signal processing. The results show that the generated code delivers sound double precision results at high performance. In particular, we observe speed-ups of up to 9.8 when compared to commonly used interval libraries using double precision. When compiling to double-double, our compiler delivers intervals that keep error accumulation small enough to compute results with at most one bit of error in double precision, i.e., certified double precision results.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127495825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a Domain-Extensible Compiler: Optimizing an Image Processing Pipeline on Mobile CPUs","authors":"T. Koehler, Michel Steuwer","doi":"10.1109/CGO51591.2021.9370337","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370337","url":null,"abstract":"Halide and many similar projects have demonstrated the great potential of domain specific optimizing compilers. They enable programs to be expressed at a convenient high-level, while generating high-performance code for parallel architectures. As domains of interest expand towards deep learning, probabilistic programming and beyond, it becomes increasingly clear that it is unsustainable to redesign domain specific compilers for each new domain. In addition, the rapid growth of hardware architectures to optimize for poses great challenges for designing these compilers. In this paper, we show how to extend a unifying domain-extensible compiler with domain-specific as well as hardware-specific optimizations. The compiler operates on generic patterns that have proven flexible enough to express a wide range of computations. Optimizations are not hard-coded into the compiler but are expressed as user-defined rewrite rules that are composed into strategies controlling the optimization process. Crucially, both computational patterns and optimization strategies are extensible without modifying the core compiler implementation. We demonstrate that this domain-extensible compiler design is capable of expressing image processing pipelines and well-known image processing optimizations. Our results on four mobile ARM multi-core CPUs, often used for image processing tasks, show that the code generated for the Harris operator outperforms the image processing library OpenCV by up to 16× and achieves performance close to - or even up to 1.4 × better than - the state-of-the-art image processing compiler Halide.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115119010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Alappat, Johannes Seiferth, G. Hager, Matthias Korch, T. Rauber, G. Wellein
{"title":"YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures","authors":"C. Alappat, Johannes Seiferth, G. Hager, Matthias Korch, T. Rauber, G. Wellein","doi":"10.1109/CGO51591.2021.9370316","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370316","url":null,"abstract":"The landscape of multi-core architectures is growing more complex and diverse. Optimal application performance tuning parameters can vary widely across CPUs, and finding them in a possibly multidimensional parameter search space can be time consuming, expensive and potentially infeasible. In this work, we introduce YaskSite, a tool capable of tackling these challenges for stencil computations. YaskSite is built upon Intel's YASK framework. It combines YASK's flexibility to deal with different target architectures with the Execution-Cache-Memory performance model, which enables identifying optimal performance parameters analytically without the need to run the code. Further we show that YaskSite's features can be exploited by external tuning frameworks to reliably select the most efficient kernel(s) for the application at hand. To demonstrate this, we integrate YaskSite into Offsite, an offline tuner for explicit ordinary differential equation methods, and show that the generated performance predictions are reliable and accurate, leading to considerable performance gains at minimal code generation time and autotuning costs on the latest Intel Cascade Lake and AMD Rome CPUs.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115562785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Execution of Graph Algorithms on CPU with SIMD Extensions","authors":"Ruohuang Zheng, Sreepathi Pai","doi":"10.1109/CGO51591.2021.9370326","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370326","url":null,"abstract":"Existing state-of-the-art CPU graph frameworks take advantage of multiple cores, but not the SIMD capability within each core. In this work, we retarget an existing GPU graph algorithm compiler to obtain the first graph framework that uses SIMD extensions on CPUs to efficiently execute graph algorithms. We evaluate this compiler on 10 benchmarks and 3 graphs on 3 different CPUs and also compare to the GPU. Evaluation results show that on a 8-core machine, enabling SIMD on a naive multi-core implementation achieves an additional 7.48x speedup, averaged across 10 benchmarks and 3 inputs. Applying our SIMD-targeted optimizations improves the plain SIMD implementation by 1.67x, outperforming a serial implementation by 12.46x. On average, the optimized multi-core SIMD version also outperforms the state-of-the-art graph framework, GraphIt, by 1.53x, averaged across 5 (common) benchmarks. SIMD execution on CPUs closes the gap between the CPU and GPU to 1.76x, but the CPU virtual memory performs better when graphs are much bigger than available physical memory.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126855295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lorenzo Chelini, Andi Drebes, O. Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, Henk Corporaal
{"title":"Progressive Raising in Multi-level IR","authors":"Lorenzo Chelini, Andi Drebes, O. Zinenko, Albert Cohen, Nicolas Vasilache, Tobias Grosser, Henk Corporaal","doi":"10.1109/CGO51591.2021.9370332","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370332","url":null,"abstract":"Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opini-onated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multilevel Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"505 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115887551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, Timothy M. Jones
{"title":"Cinnamon: A Domain-Specific Language for Binary Profiling and Monitoring","authors":"Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, Timothy M. Jones","doi":"10.1109/CGO51591.2021.9370313","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370313","url":null,"abstract":"Binary instrumentation and rewriting frameworks provide a powerful way of implementing custom analysis and transformation techniques for applications ranging from performance profiling to security monitoring. However, using these frameworks to write even simple analyses and transformations is non-trivial. Developers often need to write framework-specific boilerplate code and work with low-level and complex programming details. This not only results in hundreds (or thousands) of lines of code, but also leaves significant room for error. To address this, we introduce Cinnamon, a domain-specific language designed to write programs for binary profiling and monitoring. Cinnamon's abstractions allow the programmer to focus on implementing their technique in a platform-independent way, without worrying about complex lower-level details. Programmers can use these abstractions to perform analysis and instrumentation at different locations and granularity levels in the binary. The flexibility of Cinnamon also enables its programs to be mapped to static, dynamic or hybrid analysis and instrumentation approaches. As a proof of concept, we target Cinnamon to three different binary frameworks by implementing a custom Cinnamon to C/C++ compiler and integrating the generated code within these frameworks. We further demonstrate the ability of Cinnamon to express a range of profiling and monitoring tools through different use-cases.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116418421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-Grained Pipeline Parallelization for Network Function Programs","authors":"Seungbin Song, Heelim Choi, Hanjun Kim","doi":"10.1109/CGO51591.2021.9370309","DOIUrl":"https://doi.org/10.1109/CGO51591.2021.9370309","url":null,"abstract":"Network programming languages enable programmers to implement new network functions on various hardware and software stacks in the domain of Software Defined Net-working (SDN). Although the languages extend the flexibility of network devices, existing compilers do not fully optimize the network programs due to their coarse-grained parallelization methods. The compilers consider each packet processing table that consists of match and action functions as a unit of tasks and parallelize the programs without decomposing match and action functions. This work proposes a new fine-grained pipeline parallelization compiler for network programming languages, named PSDN. First, the PSDN compiler decouples match and action functions from packet processing tables and analyzes dependencies among the matches and actions. While respecting the dependencies, the compiler efficiently schedules each match and action function into a pipeline with clock cycle estimation and fuses functions to reduce synchronization overheads. This work implements the PSDN compiler that translates a P4 network program to a Xilinx PX program, which is synthesizable to NetFPGA-SUME hardware. The proposed compiler reduces packet processing latency by 12.1 % and utilization by 3.5 % compared to previous work.","PeriodicalId":275062,"journal":{"name":"2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116474159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}