Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00023

Tyler N. Allen, Rong Ge

{"title":"Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis","authors":"Tyler N. Allen, Rong Ge","doi":"10.1109/IPDPS49936.2021.00023","DOIUrl":null,"url":null,"abstract":"With GPUs becoming ubiquitous in HPC systems, NVIDIA’s Unified Virtual Memory (UVM) is being adopted as a measure to simplify porting of complex codes to GPU platforms by allowing demand paging between host and device memory without programmer specification. Much like its storage-based counterparts, UVM provides a great deal of added usability at the cost of performance due to the abstraction and fault-handling mechanisms. This is preventing HPC systems from being used efficiently and effectively and decreases the overall value of GPU-based systems.To mitigate the cost of page fault stall time, NVIDIA has introduced a prefetching mechanism to their UVM system. This prefetcher infers data ahead-of-time based on prior page fault history, hoping to satisfy faults before they occur. Such a prefetcher must be cleverly designed and efficient, as it operates under the constraints of a realtime system for providing effective service. Additionally, the workload is quite complex due to the parallel nature of GPU faults, as well as page fault serialization and fault source erasure within the driver. The current prefetching mechanism uses a density-prefetching algorithm to offset the side-effects of receiving page faults in parallel. While this prefetching can be very effective, it also has a negative impact on the performance of GPU oversubscription.In this paper, we provide a deep analysis of the overhead caused by UVM and the primary sources of this overhead. Additionally, we analyze the impact of NVIDIA’s prefetching and oversubscription in practice on different workloads, and correlate the performance to the driver implementation and prefetching mechanism. We provide design insights and improvement suggestions for hardware and middleware that would provide new avenues for performance gain.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

With GPUs becoming ubiquitous in HPC systems, NVIDIA’s Unified Virtual Memory (UVM) is being adopted as a measure to simplify porting of complex codes to GPU platforms by allowing demand paging between host and device memory without programmer specification. Much like its storage-based counterparts, UVM provides a great deal of added usability at the cost of performance due to the abstraction and fault-handling mechanisms. This is preventing HPC systems from being used efficiently and effectively and decreases the overall value of GPU-based systems.To mitigate the cost of page fault stall time, NVIDIA has introduced a prefetching mechanism to their UVM system. This prefetcher infers data ahead-of-time based on prior page fault history, hoping to satisfy faults before they occur. Such a prefetcher must be cleverly designed and efficient, as it operates under the constraints of a realtime system for providing effective service. Additionally, the workload is quite complex due to the parallel nature of GPU faults, as well as page fault serialization and fault source erasure within the driver. The current prefetching mechanism uses a density-prefetching algorithm to offset the side-effects of receiving page faults in parallel. While this prefetching can be very effective, it also has a negative impact on the performance of GPU oversubscription.In this paper, we provide a deep analysis of the overhead caused by UVM and the primary sources of this overhead. Additionally, we analyze the impact of NVIDIA’s prefetching and oversubscription in practice on different workloads, and correlate the performance to the driver implementation and prefetching mechanism. We provide design insights and improvement suggestions for hardware and middleware that would provide new avenues for performance gain.

查看原文本刊更多论文

揭秘GPU UVM成本与深度运行时和工作负载分析

随着GPU在HPC系统中变得无处不在，NVIDIA的统一虚拟内存(UVM)被采用为一种简化复杂代码移植到GPU平台的措施，它允许在主机和设备内存之间进行需求分页，而无需程序员规范。与基于存储的对等体非常相似，由于抽象和错误处理机制，UVM以牺牲性能为代价提供了大量额外的可用性。这阻碍了高性能计算系统的高效使用，并降低了基于gpu的系统的整体价值。为了减少页面故障拖延时间的成本，NVIDIA在他们的UVM系统中引入了一种预取机制。这个预取器根据先前的页面错误历史提前推断数据，希望在错误发生之前满足它们。这样的预取器必须设计巧妙且高效，因为它在实时系统的约束下运行以提供有效的服务。此外，由于GPU故障的并行性，以及驱动程序中的页面故障序列化和故障源擦除，工作负载非常复杂。当前的预取机制使用密度预取算法来抵消并行接收页面错误的副作用。虽然这种预取可以非常有效，但它也会对GPU过度订阅的性能产生负面影响。在本文中，我们将深入分析由UVM引起的开销以及该开销的主要来源。此外，我们分析了NVIDIA的预取和超额订阅在不同工作负载下的实际影响，并将性能与驱动程序实现和预取机制联系起来。我们提供了硬件和中间件的设计见解和改进建议，这些建议将为提高性能提供新的途径。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量