通用GPU系统的线程块调度算法分析

2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE) Pub Date : 2021-12-08 DOI:10.1109/CSDE53843.2021.9718419

Soyeon Park, Kyungwoon Cho, H. Bahn

{"title":"通用GPU系统的线程块调度算法分析","authors":"Soyeon Park, Kyungwoon Cho, H. Bahn","doi":"10.1109/CSDE53843.2021.9718419","DOIUrl":null,"url":null,"abstract":"Modern GPGPUs (General-Purpose Graphics Processing Units) have the ability of executing thousands of threads simultaneously. However, the resource utilization of GPGPU in real systems is limited as the load balancing between SMs (Stream Multiprocessors) is difficult during the scheduling of thread blocks, which are the basic units for resource allocation in GPGPU. In order to schedule thread blocks in GPGPU, the current hardware scheduler allocates thread blocks to SMs by the Round-Robin order. Although this is simple and easy to implement, we show that Round-Robin is not efficient when thread blocks of heterogeneous workloads are mixed. In such environments, efficient resource sharing in GPGPU is challenging as workloads have different resource usage patterns, but scheduling should be performed instantly. In this paper, we present a new thread block scheduling algorithm that has the ability of analyzing the load of SMs and the characteristics of pending thread blocks. Specifically, we formulate thread block scheduling as a bin-packing problem, and aim to minimize the internal fragmentation of SMs by arranging size-aware filling of thread blocks to overall SMs in advance. To do so, we make use of multiple queues for incoming thread blocks according to their sizes and perform scheduling by considering the load balancing of SMs. Our experimental results under a wide range of workload conditions show that the proposed algorithm improves the performance of GPGPU by 24.8% on average compared to the Round-Robin scheduler.","PeriodicalId":166950,"journal":{"name":"2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Analysis of Thread Block Scheduling Algorithms for General Purpose GPU Systems\",\"authors\":\"Soyeon Park, Kyungwoon Cho, H. Bahn\",\"doi\":\"10.1109/CSDE53843.2021.9718419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern GPGPUs (General-Purpose Graphics Processing Units) have the ability of executing thousands of threads simultaneously. However, the resource utilization of GPGPU in real systems is limited as the load balancing between SMs (Stream Multiprocessors) is difficult during the scheduling of thread blocks, which are the basic units for resource allocation in GPGPU. In order to schedule thread blocks in GPGPU, the current hardware scheduler allocates thread blocks to SMs by the Round-Robin order. Although this is simple and easy to implement, we show that Round-Robin is not efficient when thread blocks of heterogeneous workloads are mixed. In such environments, efficient resource sharing in GPGPU is challenging as workloads have different resource usage patterns, but scheduling should be performed instantly. In this paper, we present a new thread block scheduling algorithm that has the ability of analyzing the load of SMs and the characteristics of pending thread blocks. Specifically, we formulate thread block scheduling as a bin-packing problem, and aim to minimize the internal fragmentation of SMs by arranging size-aware filling of thread blocks to overall SMs in advance. To do so, we make use of multiple queues for incoming thread blocks according to their sizes and perform scheduling by considering the load balancing of SMs. Our experimental results under a wide range of workload conditions show that the proposed algorithm improves the performance of GPGPU by 24.8% on average compared to the Round-Robin scheduler.\",\"PeriodicalId\":166950,\"journal\":{\"name\":\"2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSDE53843.2021.9718419\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSDE53843.2021.9718419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

现代gpgpu(通用图形处理单元)具有同时执行数千个线程的能力。然而，在实际系统中，线程块是GPGPU资源分配的基本单元，在线程块的调度过程中，由于难以实现SMs (Stream multiprocessor)之间的负载均衡，使得GPGPU的资源利用率受到限制。为了调度GPGPU中的线程块，当前的硬件调度程序按轮询顺序将线程块分配给SMs。虽然这很简单且易于实现，但我们表明，当异构工作负载的线程块混合在一起时，轮询是不高效的。在这样的环境中，由于工作负载具有不同的资源使用模式，GPGPU中的有效资源共享是具有挑战性的，但是调度应该立即执行。本文提出了一种新的线程块调度算法，该算法能够分析SMs的负载和挂起线程块的特征。具体来说，我们将线程块调度制定为一个装箱问题，并通过提前安排线程块对整个SMs的大小感知填充来最小化SMs的内部碎片。为此，我们根据传入线程块的大小使用多个队列，并通过考虑SMs的负载平衡来执行调度。在各种工作负载条件下的实验结果表明，与轮询调度相比，该算法平均提高了24.8%的GPGPU性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis of Thread Block Scheduling Algorithms for General Purpose GPU Systems

Modern GPGPUs (General-Purpose Graphics Processing Units) have the ability of executing thousands of threads simultaneously. However, the resource utilization of GPGPU in real systems is limited as the load balancing between SMs (Stream Multiprocessors) is difficult during the scheduling of thread blocks, which are the basic units for resource allocation in GPGPU. In order to schedule thread blocks in GPGPU, the current hardware scheduler allocates thread blocks to SMs by the Round-Robin order. Although this is simple and easy to implement, we show that Round-Robin is not efficient when thread blocks of heterogeneous workloads are mixed. In such environments, efficient resource sharing in GPGPU is challenging as workloads have different resource usage patterns, but scheduling should be performed instantly. In this paper, we present a new thread block scheduling algorithm that has the ability of analyzing the load of SMs and the characteristics of pending thread blocks. Specifically, we formulate thread block scheduling as a bin-packing problem, and aim to minimize the internal fragmentation of SMs by arranging size-aware filling of thread blocks to overall SMs in advance. To do so, we make use of multiple queues for incoming thread blocks according to their sizes and perform scheduling by considering the load balancing of SMs. Our experimental results under a wide range of workload conditions show that the proposed algorithm improves the performance of GPGPU by 24.8% on average compared to the Round-Robin scheduler.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)

自引率

0.00%

发文量