使用效率度量和性能模型深入探索用于模板代码的GPU缓冲选项

IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-03-17 DOI:10.1109/TMSCS.2017.2705139

Yue Hu;David M. Koppelman;Steven Robert Brandt

{"title":"使用效率度量和性能模型深入探索用于模板代码的GPU缓冲选项","authors":"Yue Hu;David M. Koppelman;Steven Robert Brandt","doi":"10.1109/TMSCS.2017.2705139","DOIUrl":null,"url":null,"abstract":"Stencil computations form the basis for computer simulations across almost every field of science, such as computational fluid dynamics, data mining, and image processing. Their mostly regular data access patterns potentially enable them to take advantage of the high computation and data bandwidth of GPUs, but only if data buffering and other issues are handled properly. Finding a good code generation strategy presents a number of challenges, one of which is the best way to make use of memory. GPUs have several types of on-chip storage including registers, shared memory, and a read-only cache. The choice of type of storage and how it’s used, a \nbuffering strategy\n, for each stencil array (\ngrid function\n, [GF]) not only requires a good understanding of its stencil pattern, but also the efficiency of each type of storage for the GF, to avoid squandering storage that would be more beneficial to another GF. For a stencil computation with \n<inline-formula><tex-math>$N$</tex-math> </inline-formula>\n GFs, the total number of possible assignments is \n<inline-formula><tex-math>$\\beta ^{N}$</tex-math></inline-formula>\n where \n<inline-formula> <tex-math>$\\beta$</tex-math></inline-formula>\n is the number of buffering strategies. Our code-generation framework supports five buffering strategies (\n<inline-formula><tex-math>$\\beta =5$</tex-math></inline-formula>\n). Large, complex stencil kernels may consist of dozens of GFs, resulting in significant search overhead. In this work, we present an analytic performance model for stencil computations on GPUs and study the behavior of read-only cache and L2 cache. Next, we propose an efficiency-based assignment algorithm which operates by scoring a change in buffering strategy for a GF using a combination of (a) the predicted execution time and (b) on-chip storage usage. By using this scoring, an assignment for \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n GFs can be determined in \n<inline-formula><tex-math>$(\\beta -1)N(N+1)/2$</tex-math></inline-formula>\n steps. Results show that the performance model has good accuracy and that the assignment strategy is highly efficient.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"477-490"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2705139","citationCount":"1","resultStr":"{\"title\":\"Thoroughly Exploring GPU Buffering Options for Stencil Code by Using an Efficiency Measure and a Performance Model\",\"authors\":\"Yue Hu;David M. Koppelman;Steven Robert Brandt\",\"doi\":\"10.1109/TMSCS.2017.2705139\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stencil computations form the basis for computer simulations across almost every field of science, such as computational fluid dynamics, data mining, and image processing. Their mostly regular data access patterns potentially enable them to take advantage of the high computation and data bandwidth of GPUs, but only if data buffering and other issues are handled properly. Finding a good code generation strategy presents a number of challenges, one of which is the best way to make use of memory. GPUs have several types of on-chip storage including registers, shared memory, and a read-only cache. The choice of type of storage and how it’s used, a \\nbuffering strategy\\n, for each stencil array (\\ngrid function\\n, [GF]) not only requires a good understanding of its stencil pattern, but also the efficiency of each type of storage for the GF, to avoid squandering storage that would be more beneficial to another GF. For a stencil computation with \\n<inline-formula><tex-math>$N$</tex-math> </inline-formula>\\n GFs, the total number of possible assignments is \\n<inline-formula><tex-math>$\\\\beta ^{N}$</tex-math></inline-formula>\\n where \\n<inline-formula> <tex-math>$\\\\beta$</tex-math></inline-formula>\\n is the number of buffering strategies. Our code-generation framework supports five buffering strategies (\\n<inline-formula><tex-math>$\\\\beta =5$</tex-math></inline-formula>\\n). Large, complex stencil kernels may consist of dozens of GFs, resulting in significant search overhead. In this work, we present an analytic performance model for stencil computations on GPUs and study the behavior of read-only cache and L2 cache. Next, we propose an efficiency-based assignment algorithm which operates by scoring a change in buffering strategy for a GF using a combination of (a) the predicted execution time and (b) on-chip storage usage. By using this scoring, an assignment for \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n GFs can be determined in \\n<inline-formula><tex-math>$(\\\\beta -1)N(N+1)/2$</tex-math></inline-formula>\\n steps. Results show that the performance model has good accuracy and that the assignment strategy is highly efficient.\",\"PeriodicalId\":100643,\"journal\":{\"name\":\"IEEE Transactions on Multi-Scale Computing Systems\",\"volume\":\"4 3\",\"pages\":\"477-490\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-03-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2705139\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multi-Scale Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/7930466/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multi-Scale Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/7930466/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

模板计算构成了几乎所有科学领域的计算机模拟的基础，如计算流体力学、数据挖掘和图像处理。它们大多是规则的数据访问模式，这可能使它们能够利用GPU的高计算和数据带宽，但前提是数据缓冲和其他问题得到妥善处理。找到一个好的代码生成策略带来了许多挑战，其中之一是利用内存的最佳方式。GPU有几种类型的片上存储，包括寄存器、共享内存和只读缓存。每个模板阵列（网格函数，[GF]）的存储类型及其使用方式的选择，缓冲策略，不仅需要对其模板模式有很好的了解，还需要对GF的每种存储类型的效率有很高的了解，以避免浪费对另一个GF更有利的存储。对于具有$N$GF的模板计算，可能分配的总数为$\beta^｛N｝$，其中$\beta$是缓冲策略的数量。我们的代码生成框架支持五种缓冲策略（$\beta=5$）。大型、复杂的模板内核可能由数十个GF组成，从而导致大量的搜索开销。在这项工作中，我们提出了一个GPU上模板计算的分析性能模型，并研究了只读缓存和二级缓存的行为。接下来，我们提出了一种基于效率的分配算法，该算法通过使用（a）预测的执行时间和（b）片上存储使用的组合来对GF的缓冲策略的变化进行评分。通过使用此评分，$N$GF的分配可以按$（β-1）N（N+1）/2$步确定。结果表明，该性能模型具有良好的准确性，分配策略是高效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Thoroughly Exploring GPU Buffering Options for Stencil Code by Using an Efficiency Measure and a Performance Model

Stencil computations form the basis for computer simulations across almost every field of science, such as computational fluid dynamics, data mining, and image processing. Their mostly regular data access patterns potentially enable them to take advantage of the high computation and data bandwidth of GPUs, but only if data buffering and other issues are handled properly. Finding a good code generation strategy presents a number of challenges, one of which is the best way to make use of memory. GPUs have several types of on-chip storage including registers, shared memory, and a read-only cache. The choice of type of storage and how it’s used, a buffering strategy , for each stencil array ( grid function , [GF]) not only requires a good understanding of its stencil pattern, but also the efficiency of each type of storage for the GF, to avoid squandering storage that would be more beneficial to another GF. For a stencil computation with

$N$

GFs, the total number of possible assignments is

$\beta ^{N}$

where

$\beta$

is the number of buffering strategies. Our code-generation framework supports five buffering strategies (

$\beta =5$

). Large, complex stencil kernels may consist of dozens of GFs, resulting in significant search overhead. In this work, we present an analytic performance model for stencil computations on GPUs and study the behavior of read-only cache and L2 cache. Next, we propose an efficiency-based assignment algorithm which operates by scoring a change in buffering strategy for a GF using a combination of (a) the predicted execution time and (b) on-chip storage usage. By using this scoring, an assignment for

$N$

GFs can be determined in

$(\beta -1)N(N+1)/2$

steps. Results show that the performance model has good accuracy and that the assignment strategy is highly efficient.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multi-Scale Computing Systems

自引率

0.00%

发文量