Data-reuse optimizations for pipelined tiling with parametric tile sizes

Alexandre Isoard
{"title":"Data-reuse optimizations for pipelined tiling with parametric tile sizes","authors":"Alexandre Isoard","doi":"10.1145/2628071.2671425","DOIUrl":null,"url":null,"abstract":"Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more specialized hardware. This requires static analysis to identify the kernel input (data read) and output (data produced) and code generation for the kernel itself, the associated transfers, and the synchronization with the rest of the code (on the host CPU). In general, such tasks are done by the developer who is required to explicit the communications, allocate and size the intermediate buffers, and segment the kernel into fitting chunks of computation. When a single kernel is offloaded in a three-phases process (i.e., upload, compute, store back), such programming remains feasible: for GPUs, the developers can use OpenCL or CUDA, or rely on higherlevel abstractions, such as the directives of OpenACC1 or the garbage collector mechanisms of SPOC2. However, in some cases, it is necessary to decompose a kernel into a sequence of smaller kernels (to get blocking algorithms, thanks to loop tiling) that are optimized with pipelined communications and data reuse among blocks (tiles). The choice of tile sizes is driven by hardware capabilities such as memory bandwidth, memory size and organization, computational power, and such codes are extremely hard to obtain without automation and some cost model. The contribution supported by this abstract and the associated poster is a parametric (w.r.t. tile size) analysis technique to perform these steps, including inter-tile data reuse and pipelining, using polyhedral optimizations3. It has been presented at the IMPACT'14 workshop [2].","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"70 1-2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2671425","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more specialized hardware. This requires static analysis to identify the kernel input (data read) and output (data produced) and code generation for the kernel itself, the associated transfers, and the synchronization with the rest of the code (on the host CPU). In general, such tasks are done by the developer who is required to explicit the communications, allocate and size the intermediate buffers, and segment the kernel into fitting chunks of computation. When a single kernel is offloaded in a three-phases process (i.e., upload, compute, store back), such programming remains feasible: for GPUs, the developers can use OpenCL or CUDA, or rely on higherlevel abstractions, such as the directives of OpenACC1 or the garbage collector mechanisms of SPOC2. However, in some cases, it is necessary to decompose a kernel into a sequence of smaller kernels (to get blocking algorithms, thanks to loop tiling) that are optimized with pipelined communications and data reuse among blocks (tiles). The choice of tile sizes is driven by hardware capabilities such as memory bandwidth, memory size and organization, computational power, and such codes are extremely hard to obtain without automation and some cost model. The contribution supported by this abstract and the associated poster is a parametric (w.r.t. tile size) analysis technique to perform these steps, including inter-tile data reuse and pipelining, using polyhedral optimizations3. It has been presented at the IMPACT'14 workshop [2].
具有参数化平铺大小的流水线平铺的数据重用优化
当今硬件的多样性加剧了对优化编译器的需求。在利用硬件加速器(FPGA、GPU、专用板)时出现的一个问题是如何自动执行内核/函数卸载或概述(与函数内联相反)。其原理是将部分计算(将在加速器上执行的内核)外包给更高效但更专业的硬件。这需要静态分析来识别内核输入(读取数据)和输出(生成数据),以及内核本身的代码生成、相关的传输以及与其余代码(在主机CPU上)的同步。通常,这样的任务是由开发人员完成的,他们需要显式地进行通信,分配和调整中间缓冲区的大小,并将内核划分为合适的计算块。当一个内核在一个三阶段的过程中被卸载时(即,上传,计算,存储),这样的编程仍然是可行的:对于gpu,开发人员可以使用OpenCL或CUDA,或者依赖于更高层次的抽象,如OpenACC1的指令或SPOC2的垃圾收集器机制。然而,在某些情况下,有必要将内核分解为一系列较小的内核(通过循环平铺获得阻塞算法),这些算法通过块(平铺)之间的管道通信和数据重用进行优化。磁片大小的选择是由硬件能力驱动的,比如内存带宽、内存大小和组织、计算能力,如果没有自动化和一些成本模型,这些代码是非常难以获得的。这个摘要和相关的海报所支持的贡献是一个参数化(w.r.t.块大小)分析技术来执行这些步骤,包括块间数据重用和管道化,使用多面体优化3。它已在IMPACT'14研讨会[2]上发表。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信