Energy Efficient Task Graph Execution Using Compute Unit Masking in GPUs

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA) Pub Date : 2021-11-01 DOI:10.1109/rsdha54838.2021.00011

M. Chow, K. Ranganath, R. Lerias, Mika Shanela Carodan, Daniel Wong

{"title":"Energy Efficient Task Graph Execution Using Compute Unit Masking in GPUs","authors":"M. Chow, K. Ranganath, R. Lerias, Mika Shanela Carodan, Daniel Wong","doi":"10.1109/rsdha54838.2021.00011","DOIUrl":null,"url":null,"abstract":"The frontiers of Supercomputers are pushed by novel discrete accelerators. Accelerators such as GPUs are employed to enable faster execution of Machine Learning, Scientific and High-Performance Computing applications. However, it has been harder to gain increased parallelism in traditional workloads. This is why more focus has been into Task Graphs. AMD’s Directed Acyclic Graph Execution Engine (DAGEE) allows the programmer to define a workload in fine-grained tasks, and the system handles the dependencies at the lower-level. We evaluate DAGEE with the Winograd-Strassen Matrix Multiplication algorithm and show that DAGEE achieves on average 15.3% speed up over the traditional matrix multiplication algorithm.While using DAGEE this may increase the contention among kernels due to the increased amount of parallelism. However, AMD allows the programmer to set the number of active Compute Unit (CU) by masking. This fine-grain scaling allows the system software to enable only the required number of Computation Units within a GPU. Using this mechanism we develop a Runtime that masks CU’s for each task during a task graph execution and partitions each task into their separate CU’s, reducing overall contention and energy consumption. We show that our CU Masking runtime on average reduces energy by 18%.","PeriodicalId":119942,"journal":{"name":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/rsdha54838.2021.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The frontiers of Supercomputers are pushed by novel discrete accelerators. Accelerators such as GPUs are employed to enable faster execution of Machine Learning, Scientific and High-Performance Computing applications. However, it has been harder to gain increased parallelism in traditional workloads. This is why more focus has been into Task Graphs. AMD’s Directed Acyclic Graph Execution Engine (DAGEE) allows the programmer to define a workload in fine-grained tasks, and the system handles the dependencies at the lower-level. We evaluate DAGEE with the Winograd-Strassen Matrix Multiplication algorithm and show that DAGEE achieves on average 15.3% speed up over the traditional matrix multiplication algorithm.While using DAGEE this may increase the contention among kernels due to the increased amount of parallelism. However, AMD allows the programmer to set the number of active Compute Unit (CU) by masking. This fine-grain scaling allows the system software to enable only the required number of Computation Units within a GPU. Using this mechanism we develop a Runtime that masks CU’s for each task during a task graph execution and partitions each task into their separate CU’s, reducing overall contention and energy consumption. We show that our CU Masking runtime on average reduces energy by 18%.

查看原文本刊更多论文

在gpu中使用计算单元掩蔽的节能任务图执行

新型分立加速器推动着超级计算机的发展。使用gpu等加速器可以更快地执行机器学习、科学和高性能计算应用程序。然而，在传统工作负载中获得更高的并行性更加困难。这就是为什么任务图更受关注的原因。AMD的定向无环图执行引擎(DAGEE)允许程序员在细粒度任务中定义工作负载，系统在较低级别处理依赖关系。我们用Winograd-Strassen矩阵乘法算法对DAGEE进行了评估，结果表明DAGEE比传统的矩阵乘法算法平均提高了15.3%的速度。在使用DAGEE时，由于并行性的增加，这可能会增加内核之间的争用。然而，AMD允许程序员通过屏蔽来设置活动计算单元(CU)的数量。这种细粒度缩放允许系统软件在GPU内只启用所需数量的计算单元。使用这种机制，我们开发了一个运行时，它在任务图执行期间为每个任务屏蔽CU，并将每个任务划分到它们单独的CU中，从而减少总体争用和能耗。我们表明，我们的CU掩蔽运行时平均减少了18%的能量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)

自引率

0.00%

发文量