Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI:10.1109/HiPC.2014.7116910

Y. Wen, Zheng Wang, M. O’Boyle

{"title":"Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms","authors":"Y. Wen, Zheng Wang, M. O’Boyle","doi":"10.1109/HiPC.2014.7116910","DOIUrl":null,"url":null,"abstract":"Heterogeneous systems consisting of multiple CPUs and GPUs are increasingly attractive as platforms for high performance computing. Such platforms are usually programmed using OpenCL which provides program portability by allowing the same program to execute on different types of device. As such systems become more mainstream, they will move from application dedicated devices to platforms that need to support multiple concurrent user applications. Here there is a need to determine when and where to map different applications so as to best utilize the available heterogeneous hardware resources. In this paper, we present an efficient OpenCL task scheduling scheme which schedules multiple kernels from multiple programs on CPU/GPU heterogeneous platforms. It does this by determining at runtime which kernels are likely to best utilize a device. We show that speedup is a good scheduling priority function and develop a novel model that predicts a kernel's speedup based on its static code structure. Our scheduler uses this prediction and runtime input data size to prioritize and schedule tasks. This technique is applied to a large set of concurrent OpenCL kernels. We evaluated our approach for system throughput and average turn-around time against competitive techniques on two different platforms: a Core i7/Nvidia GTX590 and a Core i7/AMD Tahiti 7970 platforms. For system throughput, we achieve, on average, a 1.21x and 1.25x improvement over the best competitors on the NVIDIA and AMD platforms respectively. Our approach reduces the turnaround time, on average, by at least 1.5x and 1.2x on the NVIDIA and AMD platforms respectively, when compared to alternative approaches.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"134","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 134

Abstract

Heterogeneous systems consisting of multiple CPUs and GPUs are increasingly attractive as platforms for high performance computing. Such platforms are usually programmed using OpenCL which provides program portability by allowing the same program to execute on different types of device. As such systems become more mainstream, they will move from application dedicated devices to platforms that need to support multiple concurrent user applications. Here there is a need to determine when and where to map different applications so as to best utilize the available heterogeneous hardware resources. In this paper, we present an efficient OpenCL task scheduling scheme which schedules multiple kernels from multiple programs on CPU/GPU heterogeneous platforms. It does this by determining at runtime which kernels are likely to best utilize a device. We show that speedup is a good scheduling priority function and develop a novel model that predicts a kernel's speedup based on its static code structure. Our scheduler uses this prediction and runtime input data size to prioritize and schedule tasks. This technique is applied to a large set of concurrent OpenCL kernels. We evaluated our approach for system throughput and average turn-around time against competitive techniques on two different platforms: a Core i7/Nvidia GTX590 and a Core i7/AMD Tahiti 7970 platforms. For system throughput, we achieve, on average, a 1.21x and 1.25x improvement over the best competitors on the NVIDIA and AMD platforms respectively. Our approach reduces the turnaround time, on average, by at least 1.5x and 1.2x on the NVIDIA and AMD platforms respectively, when compared to alternative approaches.

查看原文本刊更多论文

OpenCL程序在CPU/GPU异构平台上的智能多任务调度

由多个cpu和gpu组成的异构系统作为高性能计算平台越来越有吸引力。这样的平台通常使用OpenCL编程，它通过允许相同的程序在不同类型的设备上执行来提供程序可移植性。随着这类系统变得越来越主流，它们将从应用程序专用设备转向需要支持多个并发用户应用程序的平台。这里需要确定何时何地映射不同的应用程序，以便最好地利用可用的异构硬件资源。本文提出了一种高效的OpenCL任务调度方案，该方案在CPU/GPU异构平台上调度来自多个程序的多个内核。它通过在运行时确定哪些内核可能最好地利用某个设备来实现这一点。我们证明了加速是一个很好的调度优先级函数，并建立了一个基于内核静态代码结构预测内核加速的新模型。我们的调度器使用此预测和运行时输入数据大小来确定任务的优先级和调度。该技术应用于大量并发OpenCL内核。我们在两个不同的平台上评估了我们的系统吞吐量和平均周转时间与竞争技术的方法:Core i7/Nvidia GTX590和Core i7/AMD Tahiti 7970平台。在系统吞吐量方面，我们在NVIDIA和AMD平台上分别比最优秀的竞争对手平均提高了1.21倍和1.25倍。与其他方法相比，我们的方法在NVIDIA和AMD平台上平均减少了至少1.5倍和1.2倍的周转时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 21st International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量