PUMP: Profiling-free Unified Memory Prefetcher for Large DNN Model Support

2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC) Pub Date : 2022-01-17 DOI:10.1109/asp-dac52403.2022.9712507

Chung-Hsiang Lin, Shaoyu Lin, Yi-Jung Chen, En-Yu Jenp, Chia-Lin Yang

引用次数: 0

Abstract

Modern DNNs are going deeper and wider to achieve higher accuracy. However, existing deep learning frameworks require the whole DNN model to fit into the GPU memory when training with GPUs, which puts an unwanted limitation on training large models. Utilizing NVIDIA Unified Memory (UM) could inherently support training DNN models beyond GPU memory capacity. However, naively adopting UM would suffer a significant performance penalty due to the delay of data transfer. In this paper, we propose PUMP, a Profiling-free Unified Memory Prefetcher. PUMP exploits GPU asynchronous execution for prefetch; that is, there exists a delay between the time that CPU launches a kernel and the time the kernel executes in GPU. PUMP extracts memory blocks accessed by the kernel when launching and swaps these blocks into GPU memory. Experimental results show PUMP achieves about 2x speedup on the average compared to the baseline that naively enables UM.

查看原文本刊更多论文

PUMP:支持大型DNN模型的无分析统一内存预取器

现代深度神经网络正在向更深、更广的方向发展，以达到更高的精度。然而，现有的深度学习框架在使用GPU进行训练时要求整个DNN模型适合GPU内存，这给训练大型模型带来了不必要的限制。利用NVIDIA统一内存(UM)可以从本质上支持超越GPU内存容量的DNN模型训练。然而，由于数据传输的延迟，天真地采用UM将遭受重大的性能损失。在本文中，我们提出了一个不需要分析的统一内存预取器PUMP。PUMP利用GPU异步执行进行预取;也就是说，在CPU启动内核的时间和内核在GPU中执行的时间之间存在延迟。PUMP在启动时提取内核访问的内存块，并将这些块交换到GPU内存中。实验结果表明，与单纯启用UM的基线相比，PUMP平均实现了约2倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)

自引率

0.00%

发文量