共生调度的并发GPU内核的性能和能源优化

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI:10.1145/2597917.2597925

Teng Li, Vikram K. Narayana, T. El-Ghazawi

{"title":"共生调度的并发GPU内核的性能和能源优化","authors":"Teng Li, Vikram K. Narayana, T. El-Ghazawi","doi":"10.1145/2597917.2597925","DOIUrl":null,"url":null,"abstract":"The incorporation of GPUs as co-processors has brought forth significant performance improvements for High-Performance Computing (HPC). Efficient utilization of the GPU resources is thus an important consideration for computer scientists. In order to obtain the required performance while limiting the energy consumption, researchers and vendors alike are seeking to apply traditional CPU approaches into the GPU computing domain. For instance, newer NVIDIA GPUs now support concurrent execution of independent kernels as well as Dynamic Voltage and Frequency Scaling (DVFS). Amidst these new developments, we are faced with new opportunities for efficiently scheduling GPU computational kernels under performance and energy constraints. In this paper, we carry out performance and energy optimizations geared towards the execution phases of concurrent kernels in GPU-based computing. When multiple GPU kernels are enqueued for concurrent execution, the sequence in which they are initiated can significantly affect the total execution time and the energy consumption. We attribute this behavior to the relative synergy among kernels that are launched within close proximity of each other. Accordingly, we define metrics for computing the extent to which kernels are symbiotic, by modeling their complementary resource requirements and execution characteristics. We then propose a symbiotic scheduling algorithm to obtain the best possible kernel launch sequence for concurrent execution. Experimental results on the latest NVIDIA K20 GPU demonstrate the efficacy of our proposed algorithm-based approach, by showing near-optimal results within the solution space of both performance and energy consumption. As our further experimental study on DVFS finds that increasing the GPU frequency in general leads to improved performance and energy saving, the proposed approach reduces the necessity for over-clocking and can be readily adopted by programmers with minimal programming effort and risk.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations\",\"authors\":\"Teng Li, Vikram K. Narayana, T. El-Ghazawi\",\"doi\":\"10.1145/2597917.2597925\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The incorporation of GPUs as co-processors has brought forth significant performance improvements for High-Performance Computing (HPC). Efficient utilization of the GPU resources is thus an important consideration for computer scientists. In order to obtain the required performance while limiting the energy consumption, researchers and vendors alike are seeking to apply traditional CPU approaches into the GPU computing domain. For instance, newer NVIDIA GPUs now support concurrent execution of independent kernels as well as Dynamic Voltage and Frequency Scaling (DVFS). Amidst these new developments, we are faced with new opportunities for efficiently scheduling GPU computational kernels under performance and energy constraints. In this paper, we carry out performance and energy optimizations geared towards the execution phases of concurrent kernels in GPU-based computing. When multiple GPU kernels are enqueued for concurrent execution, the sequence in which they are initiated can significantly affect the total execution time and the energy consumption. We attribute this behavior to the relative synergy among kernels that are launched within close proximity of each other. Accordingly, we define metrics for computing the extent to which kernels are symbiotic, by modeling their complementary resource requirements and execution characteristics. We then propose a symbiotic scheduling algorithm to obtain the best possible kernel launch sequence for concurrent execution. Experimental results on the latest NVIDIA K20 GPU demonstrate the efficacy of our proposed algorithm-based approach, by showing near-optimal results within the solution space of both performance and energy consumption. As our further experimental study on DVFS finds that increasing the GPU frequency in general leads to improved performance and energy saving, the proposed approach reduces the necessity for over-clocking and can be readily adopted by programmers with minimal programming effort and risk.\",\"PeriodicalId\":194910,\"journal\":{\"name\":\"Proceedings of the 11th ACM Conference on Computing Frontiers\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2597917.2597925\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2597917.2597925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

摘要

gpu作为协处理器的结合为高性能计算(HPC)带来了显著的性能改进。因此，高效利用GPU资源是计算机科学家的一个重要考虑因素。为了在限制能耗的同时获得所需的性能，研究人员和供应商都在寻求将传统的CPU方法应用到GPU计算领域。例如，较新的NVIDIA gpu现在支持独立内核的并发执行以及动态电压和频率缩放(DVFS)。在这些新的发展中，我们面临着在性能和能量限制下高效调度GPU计算内核的新机会。在本文中，我们针对gpu计算中并发内核的执行阶段进行了性能和能量优化。当多个GPU内核排队并发执行时，它们启动的顺序会显著影响总执行时间和能耗。我们将这种行为归因于在彼此靠近的范围内发射的内核之间的相对协同作用。因此，我们定义了计算内核共生程度的指标，通过建模它们的互补资源需求和执行特征。然后，我们提出了一种共生调度算法，以获得并发执行的最佳内核启动序列。在最新的NVIDIA K20 GPU上的实验结果证明了我们提出的基于算法的方法的有效性，在性能和能耗的解决方案空间内显示出接近最优的结果。我们对DVFS的进一步实验研究发现，提高GPU频率通常会提高性能和节能，所提出的方法减少了超频的必要性，并且可以很容易地被程序员采用，编程工作量和风险最小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations

The incorporation of GPUs as co-processors has brought forth significant performance improvements for High-Performance Computing (HPC). Efficient utilization of the GPU resources is thus an important consideration for computer scientists. In order to obtain the required performance while limiting the energy consumption, researchers and vendors alike are seeking to apply traditional CPU approaches into the GPU computing domain. For instance, newer NVIDIA GPUs now support concurrent execution of independent kernels as well as Dynamic Voltage and Frequency Scaling (DVFS). Amidst these new developments, we are faced with new opportunities for efficiently scheduling GPU computational kernels under performance and energy constraints. In this paper, we carry out performance and energy optimizations geared towards the execution phases of concurrent kernels in GPU-based computing. When multiple GPU kernels are enqueued for concurrent execution, the sequence in which they are initiated can significantly affect the total execution time and the energy consumption. We attribute this behavior to the relative synergy among kernels that are launched within close proximity of each other. Accordingly, we define metrics for computing the extent to which kernels are symbiotic, by modeling their complementary resource requirements and execution characteristics. We then propose a symbiotic scheduling algorithm to obtain the best possible kernel launch sequence for concurrent execution. Experimental results on the latest NVIDIA K20 GPU demonstrate the efficacy of our proposed algorithm-based approach, by showing near-optimal results within the solution space of both performance and energy consumption. As our further experimental study on DVFS finds that increasing the GPU frequency in general leads to improved performance and energy saving, the proposed approach reduces the necessity for over-clocking and can be readily adopted by programmers with minimal programming effort and risk.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 11th ACM Conference on Computing Frontiers

自引率

0.00%

发文量