自动利用隐含的管道并行从多个依赖内核的gpu

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI:10.1145/2967938.2967952

Gwangsun Kim, Jiyun Jeong, John Kim, M. Stephenson

{"title":"自动利用隐含的管道并行从多个依赖内核的gpu","authors":"Gwangsun Kim, Jiyun Jeong, John Kim, M. Stephenson","doi":"10.1145/2967938.2967952","DOIUrl":null,"url":null,"abstract":"Execution of GPGPU workloads consists of different stages including data I/O on the CPU, memory copy between the CPU and GPU, and kernel execution. While GPU can remain idle during I/O and memory copy, prior work has shown that overlapping data movement (I/O and memory copies) with kernel execution can improve performance. However, when there are multiple dependent kernels, the execution of the kernels is serialized and the benefit of overlapping data movement can be limited. In order to improve the performance of workloads that have multiple dependent kernels, we propose to automatically overlap the execution of kernels by exploiting implicit pipeline parallelism. We first propose Coarse-grained Reference Counting-based Scoreboarding (CRCS) to guarantee correctness during overlapped execution of multiple kernels. However, CRCS alone does not necessarily improve overall performance if the thread blocks (or CTAs) are scheduled sequentially. Thus, we propose an alternative CTA scheduler - Pipeline Parallelism-aware CTA Scheduler (PPCS) that takes available pipeline parallelism into account in CTA scheduling to maximize pipeline parallelism and improve overall performance. Our evaluation results show that the proposed mechanisms can improve performance by up to 67% (33% on average). To the best of our knowledge, this is one of the first work that enables overlapped execution of multiple dependent kernels without any kernel modification or explicitly expressing dependency by the programmer.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs\",\"authors\":\"Gwangsun Kim, Jiyun Jeong, John Kim, M. Stephenson\",\"doi\":\"10.1145/2967938.2967952\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Execution of GPGPU workloads consists of different stages including data I/O on the CPU, memory copy between the CPU and GPU, and kernel execution. While GPU can remain idle during I/O and memory copy, prior work has shown that overlapping data movement (I/O and memory copies) with kernel execution can improve performance. However, when there are multiple dependent kernels, the execution of the kernels is serialized and the benefit of overlapping data movement can be limited. In order to improve the performance of workloads that have multiple dependent kernels, we propose to automatically overlap the execution of kernels by exploiting implicit pipeline parallelism. We first propose Coarse-grained Reference Counting-based Scoreboarding (CRCS) to guarantee correctness during overlapped execution of multiple kernels. However, CRCS alone does not necessarily improve overall performance if the thread blocks (or CTAs) are scheduled sequentially. Thus, we propose an alternative CTA scheduler - Pipeline Parallelism-aware CTA Scheduler (PPCS) that takes available pipeline parallelism into account in CTA scheduling to maximize pipeline parallelism and improve overall performance. Our evaluation results show that the proposed mechanisms can improve performance by up to 67% (33% on average). To the best of our knowledge, this is one of the first work that enables overlapped execution of multiple dependent kernels without any kernel modification or explicitly expressing dependency by the programmer.\",\"PeriodicalId\":407717,\"journal\":{\"name\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2967938.2967952\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2967952","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

GPGPU工作负载的执行包括CPU上的数据I/O、CPU和GPU之间的内存拷贝、内核执行等不同阶段。虽然GPU在I/O和内存复制期间可以保持空闲，但先前的工作表明，在内核执行时重叠数据移动(I/O和内存复制)可以提高性能。然而，当存在多个依赖的内核时，内核的执行是序列化的，重叠数据移动的好处可能会受到限制。为了提高具有多个依赖内核的工作负载的性能，我们提出通过利用隐式管道并行性来自动重叠内核的执行。我们首先提出了基于粗粒度引用计数的计分板(CRCS)来保证多个内核重叠执行时的正确性。但是，如果顺序调度线程块(或cta)，单独使用CRCS并不一定会提高整体性能。因此，我们提出了一种替代的CTA调度器——管道并行感知CTA调度器(PPCS)，它在CTA调度中考虑了可用的管道并行性，以最大化管道并行性并提高整体性能。我们的评估结果表明，所提出的机制可以将性能提高67%(平均33%)。据我们所知，这是第一个允许多个依赖内核重叠执行的工作，而不需要对内核进行任何修改，也不需要程序员显式地表达依赖关系。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs

Execution of GPGPU workloads consists of different stages including data I/O on the CPU, memory copy between the CPU and GPU, and kernel execution. While GPU can remain idle during I/O and memory copy, prior work has shown that overlapping data movement (I/O and memory copies) with kernel execution can improve performance. However, when there are multiple dependent kernels, the execution of the kernels is serialized and the benefit of overlapping data movement can be limited. In order to improve the performance of workloads that have multiple dependent kernels, we propose to automatically overlap the execution of kernels by exploiting implicit pipeline parallelism. We first propose Coarse-grained Reference Counting-based Scoreboarding (CRCS) to guarantee correctness during overlapped execution of multiple kernels. However, CRCS alone does not necessarily improve overall performance if the thread blocks (or CTAs) are scheduled sequentially. Thus, we propose an alternative CTA scheduler - Pipeline Parallelism-aware CTA Scheduler (PPCS) that takes available pipeline parallelism into account in CTA scheduling to maximize pipeline parallelism and improve overall performance. Our evaluation results show that the proposed mechanisms can improve performance by up to 67% (33% on average). To the best of our knowledge, this is one of the first work that enables overlapped execution of multiple dependent kernels without any kernel modification or explicitly expressing dependency by the programmer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

自引率

0.00%

发文量