Ewan W. Crawford, Pablo Reble, Ben Tracy, Julian Miller
{"title":"Towards Deferred Execution of a SYCL Command Graph","authors":"Ewan W. Crawford, Pablo Reble, Ben Tracy, Julian Miller","doi":"10.1145/3585341.3585375","DOIUrl":null,"url":null,"abstract":"A key concept in SYCL’s execution model is the use of command groups that create a directed acyclic graph of kernel executions at runtime. A command group object defines a set of dependencies or edges that must be satisfied for kernels or nodes to be executed. However, because command group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities can be missed from the runtime not being made aware of a defined dependency graph ahead of execution. This represents de facto a built-in eager execution mode in SYCL in contrast to a lazy execution mode where definition and submission of work is decoupled. We propose an extension to the SYCL 2020 specification [6], which closes this gap by introducing the concept of a command graph. We add new mechanisms for the user to build a command graph for later execution. Commands are added to a graph, finalized to prepare for execution, and finally executed on a queue. The extension decouples overhead associated with submission by performing expensive operations and optimizations at finalize time and allowing for batching of commands at submission time. This command batching is supported by many SYCL backends but not exposed to users through the SYCL API. In addition to the benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications. Repetitive workloads no longer must redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only changing the data in input buffers or USM (Unified Shared Memory) allocations. For applications from specific domains, such as machine learning as well as computer vision, where the same command group pattern is run repeatedly for different inputs, this is particularly useful. This talk is presented in two sections. First, we provide an overview of the specification for the extension. This includes two distinct mechanisms for graph building: An explicit API that provides a new set of functions for expressing a command graph directly in SYCL code, and the “Record & Replay” API that is designed to retrofit existing codebases and enable the use of existing libraries and frameworks with minor modifications. We discuss the mechanisms available for modifying a graph after construction and the motivation for the API design compared to other similar mechanisms in use today in other programming models. In the second section of our talk, we detail the work in progress for implementing the extension in Intel’s DPC++ runtime, in particular the early-stage prototype [3]. We will show execution traces demonstrating the potential overhead reduction that is possible, as well as current limitations, and what we’ve learned from implementing it so far. This includes an overview of how our implementation maps to the various backends available and how to address situations where there is no backend support. We also examine plans for the future of our proposal and implementation and the optimization possibilities that it enables such as inter-node memory reuse and interactions with other relevant SYCL extensions. RELATED WORK CUDA Graphs supports deferred work submission in CUDA. It enables a similar work submission model as described in this work but limited to NVIDIA GPUs [5]. The OpenCL command-buffer extension [1] provides a mechanism to record a set of commands for repetitive enqueuing. It can be seen as lower level compared to the work we are presenting in this abstract. Higher level and vendor-independent approaches such as the C++ Executors’ proposal (p2300r5) [4] or Kokkos Graph [2] have implementations (at least in experimental stages) that use CUDA Graphs as a backend to reduce latencies. The work as described in our proposal would represent a viable alternative for a vendor- independent backend, including benefits like better maintainability.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A key concept in SYCL’s execution model is the use of command groups that create a directed acyclic graph of kernel executions at runtime. A command group object defines a set of dependencies or edges that must be satisfied for kernels or nodes to be executed. However, because command group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities can be missed from the runtime not being made aware of a defined dependency graph ahead of execution. This represents de facto a built-in eager execution mode in SYCL in contrast to a lazy execution mode where definition and submission of work is decoupled. We propose an extension to the SYCL 2020 specification [6], which closes this gap by introducing the concept of a command graph. We add new mechanisms for the user to build a command graph for later execution. Commands are added to a graph, finalized to prepare for execution, and finally executed on a queue. The extension decouples overhead associated with submission by performing expensive operations and optimizations at finalize time and allowing for batching of commands at submission time. This command batching is supported by many SYCL backends but not exposed to users through the SYCL API. In addition to the benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications. Repetitive workloads no longer must redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only changing the data in input buffers or USM (Unified Shared Memory) allocations. For applications from specific domains, such as machine learning as well as computer vision, where the same command group pattern is run repeatedly for different inputs, this is particularly useful. This talk is presented in two sections. First, we provide an overview of the specification for the extension. This includes two distinct mechanisms for graph building: An explicit API that provides a new set of functions for expressing a command graph directly in SYCL code, and the “Record & Replay” API that is designed to retrofit existing codebases and enable the use of existing libraries and frameworks with minor modifications. We discuss the mechanisms available for modifying a graph after construction and the motivation for the API design compared to other similar mechanisms in use today in other programming models. In the second section of our talk, we detail the work in progress for implementing the extension in Intel’s DPC++ runtime, in particular the early-stage prototype [3]. We will show execution traces demonstrating the potential overhead reduction that is possible, as well as current limitations, and what we’ve learned from implementing it so far. This includes an overview of how our implementation maps to the various backends available and how to address situations where there is no backend support. We also examine plans for the future of our proposal and implementation and the optimization possibilities that it enables such as inter-node memory reuse and interactions with other relevant SYCL extensions. RELATED WORK CUDA Graphs supports deferred work submission in CUDA. It enables a similar work submission model as described in this work but limited to NVIDIA GPUs [5]. The OpenCL command-buffer extension [1] provides a mechanism to record a set of commands for repetitive enqueuing. It can be seen as lower level compared to the work we are presenting in this abstract. Higher level and vendor-independent approaches such as the C++ Executors’ proposal (p2300r5) [4] or Kokkos Graph [2] have implementations (at least in experimental stages) that use CUDA Graphs as a backend to reduce latencies. The work as described in our proposal would represent a viable alternative for a vendor- independent backend, including benefits like better maintainability.