{"title":"通用GPU系统的线程块调度算法分析","authors":"Soyeon Park, Kyungwoon Cho, H. Bahn","doi":"10.1109/CSDE53843.2021.9718419","DOIUrl":null,"url":null,"abstract":"Modern GPGPUs (General-Purpose Graphics Processing Units) have the ability of executing thousands of threads simultaneously. However, the resource utilization of GPGPU in real systems is limited as the load balancing between SMs (Stream Multiprocessors) is difficult during the scheduling of thread blocks, which are the basic units for resource allocation in GPGPU. In order to schedule thread blocks in GPGPU, the current hardware scheduler allocates thread blocks to SMs by the Round-Robin order. Although this is simple and easy to implement, we show that Round-Robin is not efficient when thread blocks of heterogeneous workloads are mixed. In such environments, efficient resource sharing in GPGPU is challenging as workloads have different resource usage patterns, but scheduling should be performed instantly. In this paper, we present a new thread block scheduling algorithm that has the ability of analyzing the load of SMs and the characteristics of pending thread blocks. Specifically, we formulate thread block scheduling as a bin-packing problem, and aim to minimize the internal fragmentation of SMs by arranging size-aware filling of thread blocks to overall SMs in advance. To do so, we make use of multiple queues for incoming thread blocks according to their sizes and perform scheduling by considering the load balancing of SMs. Our experimental results under a wide range of workload conditions show that the proposed algorithm improves the performance of GPGPU by 24.8% on average compared to the Round-Robin scheduler.","PeriodicalId":166950,"journal":{"name":"2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Analysis of Thread Block Scheduling Algorithms for General Purpose GPU Systems\",\"authors\":\"Soyeon Park, Kyungwoon Cho, H. Bahn\",\"doi\":\"10.1109/CSDE53843.2021.9718419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern GPGPUs (General-Purpose Graphics Processing Units) have the ability of executing thousands of threads simultaneously. However, the resource utilization of GPGPU in real systems is limited as the load balancing between SMs (Stream Multiprocessors) is difficult during the scheduling of thread blocks, which are the basic units for resource allocation in GPGPU. In order to schedule thread blocks in GPGPU, the current hardware scheduler allocates thread blocks to SMs by the Round-Robin order. Although this is simple and easy to implement, we show that Round-Robin is not efficient when thread blocks of heterogeneous workloads are mixed. In such environments, efficient resource sharing in GPGPU is challenging as workloads have different resource usage patterns, but scheduling should be performed instantly. In this paper, we present a new thread block scheduling algorithm that has the ability of analyzing the load of SMs and the characteristics of pending thread blocks. Specifically, we formulate thread block scheduling as a bin-packing problem, and aim to minimize the internal fragmentation of SMs by arranging size-aware filling of thread blocks to overall SMs in advance. To do so, we make use of multiple queues for incoming thread blocks according to their sizes and perform scheduling by considering the load balancing of SMs. Our experimental results under a wide range of workload conditions show that the proposed algorithm improves the performance of GPGPU by 24.8% on average compared to the Round-Robin scheduler.\",\"PeriodicalId\":166950,\"journal\":{\"name\":\"2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSDE53843.2021.9718419\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSDE53843.2021.9718419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Analysis of Thread Block Scheduling Algorithms for General Purpose GPU Systems
Modern GPGPUs (General-Purpose Graphics Processing Units) have the ability of executing thousands of threads simultaneously. However, the resource utilization of GPGPU in real systems is limited as the load balancing between SMs (Stream Multiprocessors) is difficult during the scheduling of thread blocks, which are the basic units for resource allocation in GPGPU. In order to schedule thread blocks in GPGPU, the current hardware scheduler allocates thread blocks to SMs by the Round-Robin order. Although this is simple and easy to implement, we show that Round-Robin is not efficient when thread blocks of heterogeneous workloads are mixed. In such environments, efficient resource sharing in GPGPU is challenging as workloads have different resource usage patterns, but scheduling should be performed instantly. In this paper, we present a new thread block scheduling algorithm that has the ability of analyzing the load of SMs and the characteristics of pending thread blocks. Specifically, we formulate thread block scheduling as a bin-packing problem, and aim to minimize the internal fragmentation of SMs by arranging size-aware filling of thread blocks to overall SMs in advance. To do so, we make use of multiple queues for incoming thread blocks according to their sizes and perform scheduling by considering the load balancing of SMs. Our experimental results under a wide range of workload conditions show that the proposed algorithm improves the performance of GPGPU by 24.8% on average compared to the Round-Robin scheduler.