Proceedings of the 2018 International Symposium on Code Generation and Optimization最新文献

筛选
英文 中文
nAdroid: statically detecting ordering violations in Android applications nAdroid:在Android应用程序中静态检测排序违规
Xinwei Fu, Dongyoon Lee, Changhee Jung
{"title":"nAdroid: statically detecting ordering violations in Android applications","authors":"Xinwei Fu, Dongyoon Lee, Changhee Jung","doi":"10.1145/3168829","DOIUrl":"https://doi.org/10.1145/3168829","url":null,"abstract":"Modern mobile applications use a hybrid concurrency model. In this model, events are handled sequentially by event loop(s), and long-running tasks are offloaded to other threads. Concurrency errors in this hybrid concurrency model can take multiple forms: traditional atomicity and ordering violations between threads, as well as ordering violations between event callbacks on a single event loop. This paper presents nAdroid, a static ordering violation detector for Android applications. Using our threadification technique, nAdroid statically models event callbacks as threads. Threadification converts ordering violations between event callbacks into ordering violations between threads, after which state-of-the-art thread-based race detection tools can be applied. nAdroid then applies a combination of sound and unsound filters, based on the Android concurrency model and its happens-before relation, to prune out false and benign warnings. We evaluated nAdroid with 27 open source Android applications. Experimental results show that nAdroid detects 88 (at least 58 new) harmful ordering violations, and outperforms the state-of-the-art static technique with fewer false negatives and false positives.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126981412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Analyzing and optimizing task granularity on the JVM 在JVM上分析和优化任务粒度
Andrea Rosà, Eduardo Rosales, Walter Binder
{"title":"Analyzing and optimizing task granularity on the JVM","authors":"Andrea Rosà, Eduardo Rosales, Walter Binder","doi":"10.1145/3168828","DOIUrl":"https://doi.org/10.1145/3168828","url":null,"abstract":"Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, resulting in missed parallelization opportunities. In this paper, we provide a better understanding of task granularity for applications running on a Java Virtual Machine. We present a novel profiler which measures the granularity of every executed task. Our profiler collects carefully selected metrics from the whole system stack with only little overhead, and helps the developer locate performance problems. We analyze task granularity in the DaCapo and ScalaBench benchmark suites, revealing several inefficiencies related to fine-grained and coarse-grained tasks. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in two benchmarks, achieving speedups up to 1.53x.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116679969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
DeLICM: scalar dependence removal at zero memory cost delim:以零内存成本去除标量依赖
Michael Kruse, T. Grosser
{"title":"DeLICM: scalar dependence removal at zero memory cost","authors":"Michael Kruse, T. Grosser","doi":"10.1145/3168815","DOIUrl":"https://doi.org/10.1145/3168815","url":null,"abstract":"Increasing data movement costs motivate the integration of polyhedral loop optimizers in the standard flow (-O3) of production compilers. While polyhedral optimizers have been shown to be effective when applied as source-to-source transformation, the single static assignment form used in modern compiler mid-ends makes such optimizers less effective. Scalar dependencies (dependencies carried over a single memory location) are the main obstacle preventing effective optimization. We present DeLICM, a set of transformations which, backed by a polyhedral value analysis, eliminate problematic scalar dependences by 1) relocating scalar memory references to unused array locations and by 2) forwarding computations that otherwise cause scalar dependences. Our experiments show that DeLICM effectively eliminates dependencies introduced by compiler-internal canonicalization passes, human programmers, optimizing code generators, or inlining -- without the need for any additional memory allocation. As a result, polyhedral loop optimizations can be better integrated into compiler pass pipelines which is essential for metaprogramming optimization.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114773018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Synthesizing an instruction selection rule library from semantic specifications 基于语义规范合成指令选择规则库
Sebastian Buchwald, Andreas Fried, Sebastian Hack
{"title":"Synthesizing an instruction selection rule library from semantic specifications","authors":"Sebastian Buchwald, Andreas Fried, Sebastian Hack","doi":"10.1145/3168821","DOIUrl":"https://doi.org/10.1145/3168821","url":null,"abstract":"Instruction selection is the part of a compiler that transforms intermediate representation (IR) code into machine code. Instruction selectors build on a library of hundreds if not thousands of rules. Creating and maintaining these rules is a tedious and error-prone manual process. In this paper, we present a fully automatic approach to create provably correct rule libraries from formal specifications of the instruction set architecture and the compiler IR. We use a hybrid approach that combines enumerative techniques with template-based counterexample-guided inductive synthesis (CEGIS). Thereby, we overcome several shortcomings of existing approaches, which were not able to handle complex instructions in a reasonable amount of time. In particular, we efficiently model memory operations. Our tool synthesized a large part of the integer arithmetic rules for the x86 architecture within a few days where existing techniques could not deliver a substantial rule library within weeks. Using the rule library, we generate a prototype instruction selector that produces code on par with a manually-tuned instruction selector. Furthermore, using 63012 test cases generated from the rule library, we identified 29498 rules that both Clang and GCC miss.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130653740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
The generalized matrix chain algorithm 广义矩阵链算法
Henrik Barthels, Marcin Copik, P. Bientinesi
{"title":"The generalized matrix chain algorithm","authors":"Henrik Barthels, Marcin Copik, P. Bientinesi","doi":"10.1145/3168804","DOIUrl":"https://doi.org/10.1145/3168804","url":null,"abstract":"In this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product M := A1 A2 ⋯ An that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 9. The motivation for this work comes from the fact that—despite great advances in the development of compilers—the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131322438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Conflict-free vectorization of associative irregular applications with recent SIMD architectural advances 具有最新SIMD体系结构进展的关联不规则应用程序的无冲突矢量化
Peng Jiang, G. Agrawal
{"title":"Conflict-free vectorization of associative irregular applications with recent SIMD architectural advances","authors":"Peng Jiang, G. Agrawal","doi":"10.1145/3168827","DOIUrl":"https://doi.org/10.1145/3168827","url":null,"abstract":"Irregular applications that involve indirect memory accesses were traditionally considered unsuitable for SIMD processing. Though some progress has been made in recent years, the existing approaches require either expensive data reorganization or favorable input distribution to deliver good performance. In this work, we propose a novel vectorization approach called in-vector reduction that can efficiently accelerate a class of associative irregular applications. This approach exploits associativity in the irregular reductions to resolve the data conflicts within SIMD vectors. We implement in-vector reduction with the new conflict detecting instructions that are supported in Intel AVX-512 instruction set and provide a programming interface to facilitate the vectorization of such associative irregular applications. Compared with previous approaches, in-vector reduction eliminates a large part of the overhead of data reorganization and achieves high SIMD utilization even under adverse input distributions. The evaluation results show that our approach is efficient in vectorizing a diverse set of irregular applications, including graph algorithms, particle simulation codes, and hash-based aggregation. Our vectorization achieves 1.5x to 5.5x speedups over the original sequential codes on a single core of Intel Xeon Phi and outperforms a competing approach, conflict-masking based vectorization, by 1.4x to 11.8x.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131093643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
AutoPA: automatically generating active driver from original passive driver code AutoPA:从原始被动驱动程序代码自动生成主动驱动程序
Jia-Ju Bai, Yuping Wang, Shimin Hu
{"title":"AutoPA: automatically generating active driver from original passive driver code","authors":"Jia-Ju Bai, Yuping Wang, Shimin Hu","doi":"10.1145/3168809","DOIUrl":"https://doi.org/10.1145/3168809","url":null,"abstract":"Original device drivers are often passive in common operating systems, and they should correctly handle synchronization when concurrently invoked by multiple external threads. However, many concurrency bugs have occurred in drivers due to incautious synchronization. To solve concurrency problems, active driver is proposed to replace original passive driver. An active driver has its own thread and does not need to handle synchronization, thus the occurrence probability of many concurrency bugs can be effectively reduced. But previous approaches of active driver have some limitations. The biggest limitation is that original passive driver code needs to be manually rewritten. In this paper, we propose a practical approach, AutoPA, to automatically generate efficient active driver from original passive driver code. AutoPA uses function analysis and code instrumentation to perform automated driver generation, and it uses an improved active driver architecture to reduce performance degradation. We have evaluated AutoPA on 20 Linux drivers. The results show that AutoPA can automatically and successfully generate usable active drivers from original driver code. And generated active drivers can work normally with or without the synchronization primitives in original driver code. To check the effect of AutoPA on driver reliability, we perform fault injection testing on the generated active drivers, and find that all injected concurrency faults are well tolerated and the drivers can work normally. And the performance of generated active drivers is not obviously degraded compared to original passive drivers.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"25 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132836005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Register allocation for Intel processor graphics 为Intel处理器图形分配寄存器
Weiyu Chen, Guei-Yuan Lueh, Pratik Ashar, Kaiyu Chen, B. Cheng
{"title":"Register allocation for Intel processor graphics","authors":"Weiyu Chen, Guei-Yuan Lueh, Pratik Ashar, Kaiyu Chen, B. Cheng","doi":"10.1145/3168806","DOIUrl":"https://doi.org/10.1145/3168806","url":null,"abstract":"Register allocation is a well-studied problem, but surprisingly little work has been published on assigning registers for GPU architectures. In this paper we present the register allocator in the production compiler for Intel HD and Iris Graphics. Intel GPUs feature a large byte-addressable register file organized into banks, an expressive instruction set that supports variable SIMD-sizes and divergent control flow, and high spill overhead due to relatively long memory latencies. These distinctive characteristics impose challenges for register allocation, as input programs may have arbitrarily-sized variables, partial updates, and complex control flow. Not only should the allocator make a program spill-free, but it must also reduce the number of register bank conflicts and anti-dependencies. Since compilation occurs in a JIT environment, the allocator also needs to incur little overhead. To manage compilation overhead, our register allocation framework adopts a hybrid approach that separates the assignment of local and global variables. Several extensions are introduced to the traditional graph-coloring algorithm to support variables with different sizes and to accurately model liveness under divergent branches. Different assignment polices are applied to exploit the trade-offs between minimizing register usage and avoiding bank conflicts and anti-dependencies. Experimental results show our framework produces very few spilling kernels and can improve RA JIT time by up to 4x over pure graph-coloring. Our round-robin and bank-conflict-reduction assignment policies can also achieve up to 20% runtime improvements.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131636108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Qubit allocation 量子位分配
Marcos Yukio Siraichi, V. F. Santos, Caroline Collange, Fernando Magno Quintão Pereira
{"title":"Qubit allocation","authors":"Marcos Yukio Siraichi, V. F. Santos, Caroline Collange, Fernando Magno Quintão Pereira","doi":"10.1145/3168822","DOIUrl":"https://doi.org/10.1145/3168822","url":null,"abstract":"In May of 2016, IBM Research has made a quantum processor available in the cloud to the general public. The possibility of programming an actual quantum device has elicited much enthusiasm. Yet, quantum programming still lacks the compiler support that modern programming languages enjoy today. To use universal quantum computers like IBM's, programmers must design low-level circuits. In particular, they must map logical qubits into physical qubits that need to obey connectivity constraints. This task resembles the early days of programming, in which software was built in machine languages. In this paper, we formally introduce the qubit allocation problem and provide an exact solution to it. This optimal algorithm deals with the simple quantum machinery available today; however, it cannot scale up to the more complex architectures scheduled to appear. Thus, we also provide a heuristic solution to qubit allocation, which is faster than the current solutions already implemented to deal with this problem.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129903542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 214
Program generation for small-scale linear algebra applications 程序生成的小规模线性代数应用
Daniele G. Spampinato, Diego Fabregat-Traver, P. Bientinesi, Markus Püschel
{"title":"Program generation for small-scale linear algebra applications","authors":"Daniele G. Spampinato, Diego Fabregat-Traver, P. Bientinesi, Markus Püschel","doi":"10.1145/3168812","DOIUrl":"https://doi.org/10.1145/3168812","url":null,"abstract":"We present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linear-algebra-inspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications and higher level operations including linear systems solvers, Cholesky and LU factorizations. The output of SLinGen is performance-optimized single-source C code, optionally vectorized with intrinsics. The target of SLinGen are small-scale computations on fixed-size operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSL-based techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1-analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuous-time Lyapunov and Sylvester equations. The results show significant speed-ups compared to straightforward C with Intel icc and clang with a polyhedral optimizer, as well as library-based and template-based implementations.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133211993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信