Optimizing Data Transfers for Improved Performance on Shared GPUs Using Reinforcement Learning

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2018-05-01 DOI:10.1109/CCGRID.2018.00061

R. Luley, Qinru Qiu

{"title":"Optimizing Data Transfers for Improved Performance on Shared GPUs Using Reinforcement Learning","authors":"R. Luley, Qinru Qiu","doi":"10.1109/CCGRID.2018.00061","DOIUrl":null,"url":null,"abstract":"Optimizing resource utilization is a critical issue in cloud and cluster-based computing systems. In such systems, computing resources often consist of one or more GPU devices, and much research has already been conducted on means for maximizing compute resources through shared execution strategies. However, one of the most severe resource constraints in these scenarios is the data transfer channel between the host (i.e., CPU) and the device (i.e., GPU). Data transfer contention has been shown to have a significant impact on performance, yet methods for optimizing such contention have not been thoroughly studied. Techniques that have been examined make certain assumptions which limit effectiveness in the general case. In this paper, we introduce a heuristic which selectively aggregates transfers in order to maximize system performance by optimizing the transfer channel bandwidth. We compare this heuristic to traditional first-come-first-served approach, and apply Monte Carlo reinforcement learning to find an optimal policy for message aggregation. Finally, we evaluate the performance of Monte Carlo reinforcement learning with an arbitrarily-initialized policy. We demonstrate its effectiveness in learning optimal data transfer policy without detailed system characterization, which will enable a general adaptable solution for resource management of future systems.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2018.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Optimizing resource utilization is a critical issue in cloud and cluster-based computing systems. In such systems, computing resources often consist of one or more GPU devices, and much research has already been conducted on means for maximizing compute resources through shared execution strategies. However, one of the most severe resource constraints in these scenarios is the data transfer channel between the host (i.e., CPU) and the device (i.e., GPU). Data transfer contention has been shown to have a significant impact on performance, yet methods for optimizing such contention have not been thoroughly studied. Techniques that have been examined make certain assumptions which limit effectiveness in the general case. In this paper, we introduce a heuristic which selectively aggregates transfers in order to maximize system performance by optimizing the transfer channel bandwidth. We compare this heuristic to traditional first-come-first-served approach, and apply Monte Carlo reinforcement learning to find an optimal policy for message aggregation. Finally, we evaluate the performance of Monte Carlo reinforcement learning with an arbitrarily-initialized policy. We demonstrate its effectiveness in learning optimal data transfer policy without detailed system characterization, which will enable a general adaptable solution for resource management of future systems.

查看原文本刊更多论文

使用强化学习优化共享gpu上的数据传输以提高性能

在基于云和集群的计算系统中，优化资源利用是一个关键问题。在这样的系统中，计算资源通常由一个或多个GPU设备组成，并且已经对通过共享执行策略最大化计算资源的方法进行了大量研究。然而，在这些场景中最严重的资源限制之一是主机(即CPU)和设备(即GPU)之间的数据传输通道。数据传输争用已被证明对性能有重大影响，但优化这种争用的方法尚未得到深入研究。所研究的技术都有一定的假设，这些假设限制了一般情况下的有效性。本文引入了一种启发式算法，通过优化传输信道带宽，选择性地聚合传输，从而使系统性能最大化。我们将这种启发式方法与传统的先到先得方法进行比较，并应用蒙特卡罗强化学习来寻找消息聚合的最佳策略。最后，我们评估了随机初始化策略下蒙特卡罗强化学习的性能。我们证明了它在学习最佳数据传输策略方面的有效性，而不需要详细的系统特征，这将为未来系统的资源管理提供一个通用的适应性解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量