Towards Deferred Execution of a SYCL Command Graph

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI:10.1145/3585341.3585375

Ewan W. Crawford, Pablo Reble, Ben Tracy, Julian Miller

{"title":"Towards Deferred Execution of a SYCL Command Graph","authors":"Ewan W. Crawford, Pablo Reble, Ben Tracy, Julian Miller","doi":"10.1145/3585341.3585375","DOIUrl":null,"url":null,"abstract":"A key concept in SYCL’s execution model is the use of command groups that create a directed acyclic graph of kernel executions at runtime. A command group object defines a set of dependencies or edges that must be satisfied for kernels or nodes to be executed. However, because command group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities can be missed from the runtime not being made aware of a defined dependency graph ahead of execution. This represents de facto a built-in eager execution mode in SYCL in contrast to a lazy execution mode where definition and submission of work is decoupled. We propose an extension to the SYCL 2020 specification [6], which closes this gap by introducing the concept of a command graph. We add new mechanisms for the user to build a command graph for later execution. Commands are added to a graph, finalized to prepare for execution, and finally executed on a queue. The extension decouples overhead associated with submission by performing expensive operations and optimizations at finalize time and allowing for batching of commands at submission time. This command batching is supported by many SYCL backends but not exposed to users through the SYCL API. In addition to the benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications. Repetitive workloads no longer must redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only changing the data in input buffers or USM (Unified Shared Memory) allocations. For applications from specific domains, such as machine learning as well as computer vision, where the same command group pattern is run repeatedly for different inputs, this is particularly useful. This talk is presented in two sections. First, we provide an overview of the specification for the extension. This includes two distinct mechanisms for graph building: An explicit API that provides a new set of functions for expressing a command graph directly in SYCL code, and the “Record & Replay” API that is designed to retrofit existing codebases and enable the use of existing libraries and frameworks with minor modifications. We discuss the mechanisms available for modifying a graph after construction and the motivation for the API design compared to other similar mechanisms in use today in other programming models. In the second section of our talk, we detail the work in progress for implementing the extension in Intel’s DPC++ runtime, in particular the early-stage prototype [3]. We will show execution traces demonstrating the potential overhead reduction that is possible, as well as current limitations, and what we’ve learned from implementing it so far. This includes an overview of how our implementation maps to the various backends available and how to address situations where there is no backend support. We also examine plans for the future of our proposal and implementation and the optimization possibilities that it enables such as inter-node memory reuse and interactions with other relevant SYCL extensions. RELATED WORK CUDA Graphs supports deferred work submission in CUDA. It enables a similar work submission model as described in this work but limited to NVIDIA GPUs [5]. The OpenCL command-buffer extension [1] provides a mechanism to record a set of commands for repetitive enqueuing. It can be seen as lower level compared to the work we are presenting in this abstract. Higher level and vendor-independent approaches such as the C++ Executors’ proposal (p2300r5) [4] or Kokkos Graph [2] have implementations (at least in experimental stages) that use CUDA Graphs as a backend to reduce latencies. The work as described in our proposal would represent a viable alternative for a vendor- independent backend, including benefits like better maintainability.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

A key concept in SYCL’s execution model is the use of command groups that create a directed acyclic graph of kernel executions at runtime. A command group object defines a set of dependencies or edges that must be satisfied for kernels or nodes to be executed. However, because command group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities can be missed from the runtime not being made aware of a defined dependency graph ahead of execution. This represents de facto a built-in eager execution mode in SYCL in contrast to a lazy execution mode where definition and submission of work is decoupled. We propose an extension to the SYCL 2020 specification [6], which closes this gap by introducing the concept of a command graph. We add new mechanisms for the user to build a command graph for later execution. Commands are added to a graph, finalized to prepare for execution, and finally executed on a queue. The extension decouples overhead associated with submission by performing expensive operations and optimizations at finalize time and allowing for batching of commands at submission time. This command batching is supported by many SYCL backends but not exposed to users through the SYCL API. In addition to the benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications. Repetitive workloads no longer must redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only changing the data in input buffers or USM (Unified Shared Memory) allocations. For applications from specific domains, such as machine learning as well as computer vision, where the same command group pattern is run repeatedly for different inputs, this is particularly useful. This talk is presented in two sections. First, we provide an overview of the specification for the extension. This includes two distinct mechanisms for graph building: An explicit API that provides a new set of functions for expressing a command graph directly in SYCL code, and the “Record & Replay” API that is designed to retrofit existing codebases and enable the use of existing libraries and frameworks with minor modifications. We discuss the mechanisms available for modifying a graph after construction and the motivation for the API design compared to other similar mechanisms in use today in other programming models. In the second section of our talk, we detail the work in progress for implementing the extension in Intel’s DPC++ runtime, in particular the early-stage prototype [3]. We will show execution traces demonstrating the potential overhead reduction that is possible, as well as current limitations, and what we’ve learned from implementing it so far. This includes an overview of how our implementation maps to the various backends available and how to address situations where there is no backend support. We also examine plans for the future of our proposal and implementation and the optimization possibilities that it enables such as inter-node memory reuse and interactions with other relevant SYCL extensions. RELATED WORK CUDA Graphs supports deferred work submission in CUDA. It enables a similar work submission model as described in this work but limited to NVIDIA GPUs [5]. The OpenCL command-buffer extension [1] provides a mechanism to record a set of commands for repetitive enqueuing. It can be seen as lower level compared to the work we are presenting in this abstract. Higher level and vendor-independent approaches such as the C++ Executors’ proposal (p2300r5) [4] or Kokkos Graph [2] have implementations (at least in experimental stages) that use CUDA Graphs as a backend to reduce latencies. The work as described in our proposal would represent a viable alternative for a vendor- independent backend, including benefits like better maintainability.

查看原文本刊更多论文

关于SYCL命令图的延迟执行

SYCL执行模型中的一个关键概念是使用命令组在运行时创建内核执行的有向无循环图。命令组对象定义了要执行的内核或节点必须满足的一组依赖项或边缘。但是，由于命令组提交与队列上的执行相关联，在开始执行之前没有事先的构造步骤，因此在运行时没有在执行之前意识到已定义的依赖关系图，可能会错过优化机会。这实际上代表了SYCL中内置的急于执行模式，而不是工作的定义和提交解耦的惰性执行模式。我们建议对SYCL 2020规范[6]进行扩展，通过引入命令图的概念来缩小这一差距。我们为用户添加了新的机制来构建命令图以供以后执行。命令被添加到图中，最终确定为准备执行，最后在队列中执行。扩展通过在完成时执行昂贵的操作和优化并允许在提交时批处理命令来解耦与提交相关的开销。许多SYCL后端都支持此命令批处理，但不通过SYCL API向用户公开。除了对SYCL运行时的好处之外，对于开发SYCL应用程序的用户也有好处。重复的工作负载不再必须冗余地发出相同的命令序列。相反，图只构造一次，并根据需要多次提交执行，只更改输入缓冲区或USM(统一共享内存)分配中的数据。对于来自特定领域的应用程序，例如机器学习和计算机视觉，对于不同的输入重复运行相同的命令组模式，这特别有用。本次演讲分为两个部分。首先，我们对扩展的规范进行概述。这包括两种不同的图形构建机制:一种显式API，它提供了一组新的函数，用于直接在SYCL代码中表示命令图形;另一种“Record & Replay”API，旨在改进现有的代码库，并允许在进行少量修改后使用现有的库和框架。我们讨论了构建后修改图的机制，以及与目前在其他编程模型中使用的其他类似机制相比，API设计的动机。在我们演讲的第二部分，我们详细介绍了在Intel的dpc++运行时中实现扩展的工作进展，特别是早期的原型[3]。我们将展示执行跟踪，演示可能的潜在开销减少，以及当前的限制，以及到目前为止我们从实现它中学到的东西。这包括我们的实现如何映射到各种可用的后端，以及如何解决没有后端支持的情况的概述。我们还研究了我们的提议和实现的未来计划，以及它支持的优化可能性，例如节点间内存重用和与其他相关SYCL扩展的交互。CUDA图形在CUDA中支持延迟工作提交。它支持与本工作中描述的类似的工作提交模型，但仅限于NVIDIA gpu[5]。OpenCL命令缓冲区扩展[1]提供了一种机制来记录一组用于重复排队的命令。与我们在这篇摘要中呈现的作品相比，它可以被视为较低层次的作品。更高级别和独立于供应商的方法，如c++ Executors的提案(p2300r5)[4]或Kokkos Graph[2]，已经实现(至少在实验阶段)使用CUDA图形作为后端来减少延迟。我们提案中所描述的工作将代表独立于供应商的后端的可行替代方案，包括诸如更好的可维护性等好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2023 International Workshop on OpenCL

自引率

0.00%

发文量