Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2017-02-04 DOI:10.1145/3026937.3026942

Michael Haidl, Michel Steuwer, H. Dirks, Tim Humernbrum, S. Gorlatch

{"title":"Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views","authors":"Michael Haidl, Michel Steuwer, H. Dirks, Tim Humernbrum, S. Gorlatch","doi":"10.1145/3026937.3026942","DOIUrl":null,"url":null,"abstract":"In this paper, we advocate a composable approach to programming systems with Graphics Processing Units (GPU): programs are developed as compositions of generic, reusable patterns. Current GPU programming approaches either rely on low-level, monolithic code without patterns (CUDA and OpenCL), which achieves high performance at the cost of cumbersome and error-prone programming, or they improve the programmability by using pattern-based abstractions (e.g., Thrust) but pay a performance penalty due to inefficient implementations of pattern composition. We develop an API for GPUs based programming on C++ with STL-style patterns and its compiler-based implementation. Our API gives the application developers the native C++ means (views and actions) to specify precisely which pattern compositions should be automatically fused during code generation into a single efficient GPU kernel, thereby ensuring a high target performance. We implement our approach by extending the range-v3 library which is currently being developed for the forthcoming C++ standards. The composable programming in our approach is done exclusively in the standard C++14, with STL algorithms used as patterns which we re-implemented in parallel for GPU. Our compiler implementation is based on the LLVM and Clang frameworks, and we use advanced multi-stage programming techniques for aggressive runtime optimizations. We experimentally evaluate our approach using a set of benchmark applications and a real-world case study from the area of image processing. Our codes achieve performance competitive with CUDA monolithic implementations, and we outperform pattern-based codes written using Nvidia's Thrust.","PeriodicalId":161677,"journal":{"name":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"269 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3026937.3026942","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this paper, we advocate a composable approach to programming systems with Graphics Processing Units (GPU): programs are developed as compositions of generic, reusable patterns. Current GPU programming approaches either rely on low-level, monolithic code without patterns (CUDA and OpenCL), which achieves high performance at the cost of cumbersome and error-prone programming, or they improve the programmability by using pattern-based abstractions (e.g., Thrust) but pay a performance penalty due to inefficient implementations of pattern composition. We develop an API for GPUs based programming on C++ with STL-style patterns and its compiler-based implementation. Our API gives the application developers the native C++ means (views and actions) to specify precisely which pattern compositions should be automatically fused during code generation into a single efficient GPU kernel, thereby ensuring a high target performance. We implement our approach by extending the range-v3 library which is currently being developed for the forthcoming C++ standards. The composable programming in our approach is done exclusively in the standard C++14, with STL algorithms used as patterns which we re-implemented in parallel for GPU. Our compiler implementation is based on the LLVM and Clang frameworks, and we use advanced multi-stage programming techniques for aggressive runtime optimizations. We experimentally evaluate our approach using a set of benchmark applications and a real-world case study from the area of image processing. Our codes achieve performance competitive with CUDA monolithic implementations, and we outperform pattern-based codes written using Nvidia's Thrust.

查看原文本刊更多论文

面向可组合GPU编程:用动态动作和惰性视图编程GPU

在本文中，我们提倡使用图形处理单元(GPU)编程系统的可组合方法:将程序开发为通用的可重用模式的组合。当前的GPU编程方法要么依赖于低级的、没有模式的单片代码(CUDA和OpenCL)，以繁琐和易出错的编程为代价来实现高性能，要么通过使用基于模式的抽象(例如，Thrust)来提高可编程性，但由于模式组合的低效实现而付出性能损失。我们开发了一个基于stl风格的c++编程的gpu API及其基于编译器的实现。我们的API为应用程序开发人员提供了本地c++方法(视图和操作)，以精确地指定在代码生成过程中应该自动将哪些模式组合融合到单个高效的GPU内核中，从而确保高目标性能。我们通过扩展range-v3库来实现我们的方法，该库目前正在为即将到来的c++标准开发。我们的方法中的可组合编程完全在标准c++ 14中完成，使用STL算法作为模式，我们为GPU并行重新实现。我们的编译器实现基于LLVM和Clang框架，我们使用先进的多阶段编程技术进行积极的运行时优化。我们使用一组基准应用程序和来自图像处理领域的实际案例研究对我们的方法进行了实验评估。我们的代码实现了与CUDA单片实现竞争的性能，并且我们优于使用Nvidia的Thrust编写的基于模式的代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

自引率

0.00%

发文量