Moka: Model-based concurrent kernel analysis

2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI:10.1109/IISWC.2017.8167777

Leiming Yu, Xun Gong, Yifan Sun, Q. Fang, Norman Rubin, D. Kaeli

{"title":"Moka: Model-based concurrent kernel analysis","authors":"Leiming Yu, Xun Gong, Yifan Sun, Q. Fang, Norman Rubin, D. Kaeli","doi":"10.1109/IISWC.2017.8167777","DOIUrl":null,"url":null,"abstract":"GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). However, the application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intentive vs. memory-intensive), the kernel launch order and inter-kernel dependencies. Minor changes in any of theses factors can have a large impact on the effectiveness of CKE. In this paper, we present Moka, an empirical model for tuning concurrent kernel performance. Moka allows us to accurately predict the resulting performance and scalability of multi-kernel applications when using CKE. We consider both static and dynamic workload characteristics that impact the utility of CKE, and leverage these metrics to drive kernel scheduling decisions on NVIDIA GPUs. The underlying data transfer pattern and GPU resource contention are analyzed in detail. Our model is able to accurately predict the performance ceiling of concurrent kernel execution. We validate our model using several real-world applications that have multiple kernels that can run concurrently, and evaluate CKE performance on a NVIDIA Maxwell GPU. Our model is able to predict the performance of CKE applications accurately, providing estimates that differ by less than 12% as compared to actual runtime performance. Using our estimates, we can quickly find the best CKE strategy for our applications to achieve improved application throughput. We believe we have developed a useful tool to aid application programmers to accelerate their applications using CKE.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"63 Suppl 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2017.8167777","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). However, the application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intentive vs. memory-intensive), the kernel launch order and inter-kernel dependencies. Minor changes in any of theses factors can have a large impact on the effectiveness of CKE. In this paper, we present Moka, an empirical model for tuning concurrent kernel performance. Moka allows us to accurately predict the resulting performance and scalability of multi-kernel applications when using CKE. We consider both static and dynamic workload characteristics that impact the utility of CKE, and leverage these metrics to drive kernel scheduling decisions on NVIDIA GPUs. The underlying data transfer pattern and GPU resource contention are analyzed in detail. Our model is able to accurately predict the performance ceiling of concurrent kernel execution. We validate our model using several real-world applications that have multiple kernels that can run concurrently, and evaluate CKE performance on a NVIDIA Maxwell GPU. Our model is able to predict the performance of CKE applications accurately, providing estimates that differ by less than 12% as compared to actual runtime performance. Using our estimates, we can quickly find the best CKE strategy for our applications to achieve improved application throughput. We believe we have developed a useful tool to aid application programmers to accelerate their applications using CKE.

查看原文本刊更多论文

Moka:基于模型的并发核分析

每一代gpu都在不断增加计算资源的数量。许多数据并行应用程序已经重新设计，以利用GPU上的数千个内核。但是并不是每个内核都能充分利用所有可用的资源。许多应用程序包含可能并发运行的多个内核。为了更好地利用GPU上的大量资源，设备供应商已经开始支持并发内核执行(CKE)。然而，CKE提供的应用程序吞吐量受到许多因素的影响，包括内核配置属性、每个内核的动态行为(例如，计算密集型还是内存密集型)、内核启动顺序和内核间依赖关系。这些因素中的任何一个微小变化都可能对CKE的有效性产生重大影响。在本文中，我们提出了Moka，一个优化并发内核性能的经验模型。Moka允许我们在使用CKE时准确地预测多内核应用程序的最终性能和可伸缩性。我们考虑了影响CKE效用的静态和动态工作负载特征，并利用这些指标来驱动NVIDIA gpu上的内核调度决策。详细分析了底层数据传输模式和GPU资源争用问题。我们的模型能够准确地预测并发内核执行的性能上限。我们使用几个真实世界的应用程序来验证我们的模型，这些应用程序具有多个可以并发运行的内核，并在NVIDIA Maxwell GPU上评估CKE性能。我们的模型能够准确地预测CKE应用程序的性能，提供的估计值与实际运行时性能的差异小于12%。使用我们的估计，我们可以快速找到应用程序的最佳CKE策略，以实现改进的应用程序吞吐量。我们相信我们已经开发了一个有用的工具来帮助应用程序程序员使用CKE来加速他们的应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量