Improving Application Concurrency on GPUs by Managing Implicit and Explicit Synchronizations

2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2015-12-14 DOI:10.1109/ICPADS.2015.73

M. Butler, Kittisak Sajjapongse, M. Becchi

{"title":"Improving Application Concurrency on GPUs by Managing Implicit and Explicit Synchronizations","authors":"M. Butler, Kittisak Sajjapongse, M. Becchi","doi":"10.1109/ICPADS.2015.73","DOIUrl":null,"url":null,"abstract":"Originally designed to be used as dedicated coprocessors, GPUs have progressively become part of shared computing environments, such as HPC servers and clusters. Commonly used GPU software stacks (e.g., CUDA and OpenCL), however, are designed for the dedicated use of GPUs by a single application, possibly leading to resource underutilization when multiple applications share the GPU resources. In recent years, several node-level runtime components have been proposed to target this problem and allow the efficient sharing of GPUs among concurrent applications. The concurrency enabled by these systems, however, is limited by synchronizations embedded in the applications or implicitly introduced by the GPU software stack. This work targets this problem. We first analyze the effect of explicit and implicit synchronizations on application concurrency and GPU utilization. We then design runtime mechanisms to bypass these synchronizations, along with a memory management scheme that can be integrated with these synchronization avoidance mechanisms to improve GPU utilization and system throughput. We integrate these mechanisms into a recently proposed GPU virtualization runtime named Sync-Free GPU (SF-GPU), thus removing unnecessary blockages caused by multitenancy, ensuring any two applications running on the same device experience limited to no interference, maximizing the level of concurrency supported. We also release our mechanisms in the form of a software API that can be used by programmers to improve the performance of their applications without modifying their code. Finally, we evaluate the impact of our proposed mechanisms on applications run in isolation and concurrently.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS.2015.73","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Originally designed to be used as dedicated coprocessors, GPUs have progressively become part of shared computing environments, such as HPC servers and clusters. Commonly used GPU software stacks (e.g., CUDA and OpenCL), however, are designed for the dedicated use of GPUs by a single application, possibly leading to resource underutilization when multiple applications share the GPU resources. In recent years, several node-level runtime components have been proposed to target this problem and allow the efficient sharing of GPUs among concurrent applications. The concurrency enabled by these systems, however, is limited by synchronizations embedded in the applications or implicitly introduced by the GPU software stack. This work targets this problem. We first analyze the effect of explicit and implicit synchronizations on application concurrency and GPU utilization. We then design runtime mechanisms to bypass these synchronizations, along with a memory management scheme that can be integrated with these synchronization avoidance mechanisms to improve GPU utilization and system throughput. We integrate these mechanisms into a recently proposed GPU virtualization runtime named Sync-Free GPU (SF-GPU), thus removing unnecessary blockages caused by multitenancy, ensuring any two applications running on the same device experience limited to no interference, maximizing the level of concurrency supported. We also release our mechanisms in the form of a software API that can be used by programmers to improve the performance of their applications without modifying their code. Finally, we evaluate the impact of our proposed mechanisms on applications run in isolation and concurrently.

查看原文本刊更多论文

通过管理隐式和显式同步来提高gpu上的应用程序并发性

gpu最初被设计为专用的协处理器，现在已经逐渐成为共享计算环境的一部分，比如HPC服务器和集群。然而，常用的GPU软件栈(例如CUDA和OpenCL)是为单个应用程序专门使用GPU而设计的，当多个应用程序共享GPU资源时，可能会导致资源利用率不足。近年来，已经提出了几个节点级运行时组件来解决这个问题，并允许在并发应用程序之间有效地共享gpu。然而，这些系统支持的并发性受到应用程序中嵌入的同步或GPU软件堆栈隐含引入的同步的限制。这项工作针对这个问题。我们首先分析了显式和隐式同步对应用程序并发性和GPU利用率的影响。然后，我们设计了运行时机制来绕过这些同步，以及可以与这些同步避免机制集成的内存管理方案，以提高GPU利用率和系统吞吐量。我们将这些机制集成到最近提出的GPU虚拟化运行时中，名为Sync-Free GPU (SF-GPU)，从而消除了多租户造成的不必要阻塞，确保在同一设备上运行的任何两个应用程序都不受干扰，最大限度地提高了支持的并发性水平。我们还以软件API的形式发布了我们的机制，程序员可以使用它来提高应用程序的性能，而无需修改他们的代码。最后，我们评估了我们提出的机制对隔离并发运行的应用程序的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量