基于在线结构运行时预测的并发GPGPU内核抢占式线程块调度

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-06-23 DOI:10.1145/2628071.2628117

Sreepathi Pai, R. Govindarajan, M. J. Thazhuthaveetil

{"title":"基于在线结构运行时预测的并发GPGPU内核抢占式线程块调度","authors":"Sreepathi Pai, R. Govindarajan, M. J. Thazhuthaveetil","doi":"10.1145/2628071.2628117","DOIUrl":null,"url":null,"abstract":"Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) currently uses the FIFO policy to schedule thread blocks of concurrent kernels. We show that the FIFO policy leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive Shortest Remaining Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime of GPU kernels, we show that such an estimate of the runtime can be easily obtained using online profiling and exploiting a simple observation on GPU kernels' grid structure. Specifically, we propose a novel Structural Runtime Predictor. Using a simple Staircase model of GPU kernel execution, we show that the runtime of a kernel can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench, and find that it can estimate the actual runtime reasonably well after the execution of only a single thread block. Next, we design a thread block scheduler that is both concurrent kernel-aware and uses this predictor. We implement the Shortest Remaining Time First (SRTF) policy and evaluate it on two-program workloads from ER-CBench. SRTF improves STP by 1.18× and ANTT by 2.25× over FIFO. When compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16× and ANTT by 1.3×. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12×, ANTT by 2.23× and Fairness by 2.95× compared to FIFO. Overall, our implementation of SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels\",\"authors\":\"Sreepathi Pai, R. Govindarajan, M. J. Thazhuthaveetil\",\"doi\":\"10.1145/2628071.2628117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) currently uses the FIFO policy to schedule thread blocks of concurrent kernels. We show that the FIFO policy leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive Shortest Remaining Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime of GPU kernels, we show that such an estimate of the runtime can be easily obtained using online profiling and exploiting a simple observation on GPU kernels' grid structure. Specifically, we propose a novel Structural Runtime Predictor. Using a simple Staircase model of GPU kernel execution, we show that the runtime of a kernel can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench, and find that it can estimate the actual runtime reasonably well after the execution of only a single thread block. Next, we design a thread block scheduler that is both concurrent kernel-aware and uses this predictor. We implement the Shortest Remaining Time First (SRTF) policy and evaluate it on two-program workloads from ER-CBench. SRTF improves STP by 1.18× and ANTT by 2.25× over FIFO. When compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16× and ANTT by 1.3×. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12×, ANTT by 2.23× and Fairness by 2.95× compared to FIFO. Overall, our implementation of SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.\",\"PeriodicalId\":263670,\"journal\":{\"name\":\"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2628071.2628117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2628117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

最新的NVIDIA图形处理单元(gpu)可以同时执行多个内核。在这些gpu上，线程块调度程序(TBS)目前使用FIFO策略来调度并发内核的线程块。我们表明，FIFO策略将性能留给机会，导致性能和公平性的重大损失。为了提高性能和公平性，我们建议使用抢占式的最短剩余时间优先(SRTF)策略。尽管SRTF需要估计GPU内核的运行时，但我们表明，使用在线分析和利用对GPU内核网格结构的简单观察，可以很容易地获得这种运行时的估计。具体来说，我们提出了一个新的结构运行时预测器。使用GPU内核执行的简单阶梯模型，我们表明内核的运行时可以通过仅分析前几个线程块来预测。我们在ERCBench的基准测试中评估了基于该模型的在线预测器，并发现它可以在仅执行单个线程块后相当好地估计实际运行时。接下来，我们设计一个线程块调度器，它既能感知并发内核，又能使用这个预测器。我们实现了最短剩余时间优先(SRTF)策略，并在ER-CBench的双程序工作负载上对其进行了评估。与FIFO相比，SRTF将STP提高1.18倍，ANTT提高2.25倍。与MPMax(一种用于并发内核的最先进的资源分配策略)相比，SRTF将STP提高1.16倍，将ANTT提高1.3倍。为了提高公平性，我们还提出了SRTF/Adaptive来控制并发执行内核的资源使用，以最大限度地提高公平性。与FIFO相比，SRTF/Adaptive将STP提高1.12倍，ANTT提高2.23倍，公平性提高2.95倍。总体而言，我们的SRTF实现实现了系统吞吐量在最短作业优先(SJF, oracle最优调度策略)的12.64%以内，弥合了FIFO和SJF之间49%的差距。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels

Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) currently uses the FIFO policy to schedule thread blocks of concurrent kernels. We show that the FIFO policy leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive Shortest Remaining Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime of GPU kernels, we show that such an estimate of the runtime can be easily obtained using online profiling and exploiting a simple observation on GPU kernels' grid structure. Specifically, we propose a novel Structural Runtime Predictor. Using a simple Staircase model of GPU kernel execution, we show that the runtime of a kernel can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench, and find that it can estimate the actual runtime reasonably well after the execution of only a single thread block. Next, we design a thread block scheduler that is both concurrent kernel-aware and uses this predictor. We implement the Shortest Remaining Time First (SRTF) policy and evaluate it on two-program workloads from ER-CBench. SRTF improves STP by 1.18× and ANTT by 2.25× over FIFO. When compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16× and ANTT by 1.3×. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12×, ANTT by 2.23× and Fairness by 2.95× compared to FIFO. Overall, our implementation of SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量