Data-reuse optimizations for pipelined tiling with parametric tile sizes

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI:10.1145/2628071.2671425

Alexandre Isoard

{"title":"Data-reuse optimizations for pipelined tiling with parametric tile sizes","authors":"Alexandre Isoard","doi":"10.1145/2628071.2671425","DOIUrl":null,"url":null,"abstract":"Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more specialized hardware. This requires static analysis to identify the kernel input (data read) and output (data produced) and code generation for the kernel itself, the associated transfers, and the synchronization with the rest of the code (on the host CPU). In general, such tasks are done by the developer who is required to explicit the communications, allocate and size the intermediate buffers, and segment the kernel into fitting chunks of computation. When a single kernel is offloaded in a three-phases process (i.e., upload, compute, store back), such programming remains feasible: for GPUs, the developers can use OpenCL or CUDA, or rely on higherlevel abstractions, such as the directives of OpenACC1 or the garbage collector mechanisms of SPOC2. However, in some cases, it is necessary to decompose a kernel into a sequence of smaller kernels (to get blocking algorithms, thanks to loop tiling) that are optimized with pipelined communications and data reuse among blocks (tiles). The choice of tile sizes is driven by hardware capabilities such as memory bandwidth, memory size and organization, computational power, and such codes are extremely hard to obtain without automation and some cost model. The contribution supported by this abstract and the associated poster is a parametric (w.r.t. tile size) analysis technique to perform these steps, including inter-tile data reuse and pipelining, using polyhedral optimizations3. It has been presented at the IMPACT'14 workshop [2].","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"70 1-2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2671425","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more specialized hardware. This requires static analysis to identify the kernel input (data read) and output (data produced) and code generation for the kernel itself, the associated transfers, and the synchronization with the rest of the code (on the host CPU). In general, such tasks are done by the developer who is required to explicit the communications, allocate and size the intermediate buffers, and segment the kernel into fitting chunks of computation. When a single kernel is offloaded in a three-phases process (i.e., upload, compute, store back), such programming remains feasible: for GPUs, the developers can use OpenCL or CUDA, or rely on higherlevel abstractions, such as the directives of OpenACC1 or the garbage collector mechanisms of SPOC2. However, in some cases, it is necessary to decompose a kernel into a sequence of smaller kernels (to get blocking algorithms, thanks to loop tiling) that are optimized with pipelined communications and data reuse among blocks (tiles). The choice of tile sizes is driven by hardware capabilities such as memory bandwidth, memory size and organization, computational power, and such codes are extremely hard to obtain without automation and some cost model. The contribution supported by this abstract and the associated poster is a parametric (w.r.t. tile size) analysis technique to perform these steps, including inter-tile data reuse and pipelining, using polyhedral optimizations3. It has been presented at the IMPACT'14 workshop [2].

查看原文本刊更多论文

具有参数化平铺大小的流水线平铺的数据重用优化

当今硬件的多样性加剧了对优化编译器的需求。在利用硬件加速器(FPGA、GPU、专用板)时出现的一个问题是如何自动执行内核/函数卸载或概述(与函数内联相反)。其原理是将部分计算(将在加速器上执行的内核)外包给更高效但更专业的硬件。这需要静态分析来识别内核输入(读取数据)和输出(生成数据)，以及内核本身的代码生成、相关的传输以及与其余代码(在主机CPU上)的同步。通常，这样的任务是由开发人员完成的，他们需要显式地进行通信，分配和调整中间缓冲区的大小，并将内核划分为合适的计算块。当一个内核在一个三阶段的过程中被卸载时(即，上传，计算，存储)，这样的编程仍然是可行的:对于gpu，开发人员可以使用OpenCL或CUDA，或者依赖于更高层次的抽象，如OpenACC1的指令或SPOC2的垃圾收集器机制。然而，在某些情况下，有必要将内核分解为一系列较小的内核(通过循环平铺获得阻塞算法)，这些算法通过块(平铺)之间的管道通信和数据重用进行优化。磁片大小的选择是由硬件能力驱动的，比如内存带宽、内存大小和组织、计算能力，如果没有自动化和一些成本模型，这些代码是非常难以获得的。这个摘要和相关的海报所支持的贡献是一个参数化(w.r.t.块大小)分析技术来执行这些步骤，包括块间数据重用和管道化，使用多面体优化3。它已在IMPACT'14研讨会[2]上发表。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量