PUMP: Profiling-free Unified Memory Prefetcher for Large DNN Model Support

Chung-Hsiang Lin, Shaoyu Lin, Yi-Jung Chen, En-Yu Jenp, Chia-Lin Yang
{"title":"PUMP: Profiling-free Unified Memory Prefetcher for Large DNN Model Support","authors":"Chung-Hsiang Lin, Shaoyu Lin, Yi-Jung Chen, En-Yu Jenp, Chia-Lin Yang","doi":"10.1109/asp-dac52403.2022.9712507","DOIUrl":null,"url":null,"abstract":"Modern DNNs are going deeper and wider to achieve higher accuracy. However, existing deep learning frameworks require the whole DNN model to fit into the GPU memory when training with GPUs, which puts an unwanted limitation on training large models. Utilizing NVIDIA Unified Memory (UM) could inherently support training DNN models beyond GPU memory capacity. However, naively adopting UM would suffer a significant performance penalty due to the delay of data transfer. In this paper, we propose PUMP, a Profiling-free Unified Memory Prefetcher. PUMP exploits GPU asynchronous execution for prefetch; that is, there exists a delay between the time that CPU launches a kernel and the time the kernel executes in GPU. PUMP extracts memory blocks accessed by the kernel when launching and swaps these blocks into GPU memory. Experimental results show PUMP achieves about 2x speedup on the average compared to the baseline that naively enables UM.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/asp-dac52403.2022.9712507","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Modern DNNs are going deeper and wider to achieve higher accuracy. However, existing deep learning frameworks require the whole DNN model to fit into the GPU memory when training with GPUs, which puts an unwanted limitation on training large models. Utilizing NVIDIA Unified Memory (UM) could inherently support training DNN models beyond GPU memory capacity. However, naively adopting UM would suffer a significant performance penalty due to the delay of data transfer. In this paper, we propose PUMP, a Profiling-free Unified Memory Prefetcher. PUMP exploits GPU asynchronous execution for prefetch; that is, there exists a delay between the time that CPU launches a kernel and the time the kernel executes in GPU. PUMP extracts memory blocks accessed by the kernel when launching and swaps these blocks into GPU memory. Experimental results show PUMP achieves about 2x speedup on the average compared to the baseline that naively enables UM.
PUMP:支持大型DNN模型的无分析统一内存预取器
现代深度神经网络正在向更深、更广的方向发展,以达到更高的精度。然而,现有的深度学习框架在使用GPU进行训练时要求整个DNN模型适合GPU内存,这给训练大型模型带来了不必要的限制。利用NVIDIA统一内存(UM)可以从本质上支持超越GPU内存容量的DNN模型训练。然而,由于数据传输的延迟,天真地采用UM将遭受重大的性能损失。在本文中,我们提出了一个不需要分析的统一内存预取器PUMP。PUMP利用GPU异步执行进行预取;也就是说,在CPU启动内核的时间和内核在GPU中执行的时间之间存在延迟。PUMP在启动时提取内核访问的内存块,并将这些块交换到GPU内存中。实验结果表明,与单纯启用UM的基线相比,PUMP平均实现了约2倍的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信