VirtCL: a framework for OpenCL device abstraction and management

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI:10.1145/2688500.2688505

Yi-Ping You, Hen-Jung Wu, Y. Tsai, Y. Chao

{"title":"VirtCL: a framework for OpenCL device abstraction and management","authors":"Yi-Ping You, Hen-Jung Wu, Y. Tsai, Y. Chao","doi":"10.1145/2688500.2688505","DOIUrl":null,"url":null,"abstract":"The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2688500.2688505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.

查看原文本刊更多论文

一个用于OpenCL设备抽象和管理的框架

近年来，使用多个图形处理单元(gpu)来加速应用程序的兴趣有所增加。然而，现有的异构编程模型(例如，OpenCL)在每个设备级别抽象了GPU设备的细节，并要求程序员在配备多个GPU设备的系统上显式地调度其内核任务。不幸的是，在多GPU系统上运行的多个应用程序可能会竞争一些GPU设备，而使其他GPU设备闲置。此外，OpenCL中定义的分布式内存模型，每个设备都有自己的内存空间，增加了在多个GPU设备之间管理内存的复杂性。在本文中，我们提出了一个框架(称为VirtCL)，它作为程序员和本机OpenCL运行时系统之间的层，将多个设备抽象为单个虚拟设备，并调度多个设备之间的计算和通信，从而减少了编程负担。VirtCL包括两个主要组件:(1)一个前端库，它公开了主要的OpenCL api和虚拟设备;(2)一个后端运行时系统(称为CLDaemon)，用于基于基于历史的调度器调度和分派内核任务。前端库将计算请求转发给后端CLDaemon，后者然后调度和分派请求。我们还提出了一个基于历史的调度器，它能够以竞争和通信感知的方式调度内核任务。实验表明，VirtCL框架引入了很小的开销(平均6%)，但在Rodinia基准套件中的大多数基准测试中，它的性能优于本机OpenCL运行时系统，这是由于抽象层消除了耗时的OpenCL上下文初始化。我们还使用实际应用程序(clsurf)和各种合成工作负载跟踪评估了VirtCL中的不同调度策略。结果表明，VirtCL框架为运行在多gpu系统上的多个内核任务提供了可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量