Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI:10.1145/3497776.3517777

Naums Mogers, Lu Li, Valentin Radu, Christophe Dubach

{"title":"Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs","authors":"Naums Mogers, Lu Li, Valentin Radu, Christophe Dubach","doi":"10.1145/3497776.3517777","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) are notoriously hard to optimize for manually. What is needed are good automatic code generators and optimizers. Accelerate, Futhark and Lift demonstrated that a functional approach is well suited for this challenge. Lift, for instance, uses a system of rewrite rules with a multi-stage approach. Algorithmic optimizations are first explored, followed by hardware-specific optimizations such as using shared memory and mapping parallelism. While the algorithmic exploration leads to correct transformed programs by construction, it is not necessarily true for the latter phase. Exploiting shared memory and mapping parallelism while ensuring correct synchronization is a delicate balancing act, and is hard to encode in a rewrite system. Currently, Lift relies on heuristics with ad-hoc mechanisms to check for correctness. Although this practical approach eventually produces high-performance code, it is not an ideal state of affairs. This paper proposes to extract parallelization constraints automatically from a functional IR and use a solver to identify valid rewriting. Using a convolutional neural network on a mobile GPU as a use case, this approach matches the performance of the ARM Compute Library GEMM convolution and the TVM-generated kernel consuming between 2.7x and 3.6x less memory on average. Furthermore, a speedup of 12x is achieved over the ARM Compute Library direct convolution implementation.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3497776.3517777","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Graphics Processing Units (GPUs) are notoriously hard to optimize for manually. What is needed are good automatic code generators and optimizers. Accelerate, Futhark and Lift demonstrated that a functional approach is well suited for this challenge. Lift, for instance, uses a system of rewrite rules with a multi-stage approach. Algorithmic optimizations are first explored, followed by hardware-specific optimizations such as using shared memory and mapping parallelism. While the algorithmic exploration leads to correct transformed programs by construction, it is not necessarily true for the latter phase. Exploiting shared memory and mapping parallelism while ensuring correct synchronization is a delicate balancing act, and is hard to encode in a rewrite system. Currently, Lift relies on heuristics with ad-hoc mechanisms to check for correctness. Although this practical approach eventually produces high-performance code, it is not an ideal state of affairs. This paper proposes to extract parallelization constraints automatically from a functional IR and use a solver to identify valid rewriting. Using a convolutional neural network on a mobile GPU as a use case, this approach matches the performance of the ARM Compute Library GEMM convolution and the TVM-generated kernel consuming between 2.7x and 3.6x less memory on average. Furthermore, a speedup of 12x is achieved over the ARM Compute Library direct convolution implementation.

查看原文本刊更多论文

通过约束满足映射函数IR中的并行性:移动gpu卷积的案例研究

众所周知，图形处理单元(gpu)很难手动优化。我们需要的是好的自动代码生成器和优化器。Accelerate、Futhark和Lift证明了一种功能性的方法非常适合这一挑战。举个例子，Lift使用了一个多阶段重写规则的系统。首先探讨算法优化，然后是特定于硬件的优化，例如使用共享内存和映射并行性。虽然算法探索通过构造导致正确的转换程序，但对于后一阶段不一定是正确的。在确保正确同步的同时利用共享内存和映射并行性是一种微妙的平衡行为，很难在重写系统中进行编码。目前，Lift依赖于带有特殊机制的启发式方法来检查正确性。尽管这种实用的方法最终会产生高性能的代码，但它并不是一种理想的状态。本文提出了一种从函数IR中自动提取并行化约束的方法，并使用求解器来识别有效的重写。使用移动GPU上的卷积神经网络作为用例，这种方法匹配ARM计算库GEMM卷积的性能，并且tvm生成的内核平均消耗2.7到3.6倍的内存。此外，在ARM计算库的直接卷积实现上实现了12倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

自引率

0.00%

发文量