Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Zhongyi Lin, Louis Feng, E. K. Ardestani, Jaewon Lee, J. Lundell, Changkyu Kim, A. Kejariwal, John Douglas Owens
{"title":"Building a Performance Model for Deep Learning Recommendation Model Training on GPUs","authors":"Zhongyi Lin, Louis Feng, E. K. Ardestani, Jaewon Lee, J. Lundell, Changkyu Kim, A. Kejariwal, John Douglas Owens","doi":"10.1109/HiPC56025.2022.00019","DOIUrl":null,"url":null,"abstract":"We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"129 11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.
基于gpu的深度学习推荐模型训练性能模型构建
我们设计了一个深度学习推荐模型(DLRM)的GPU训练性能模型,与其他优化好的CV和NLP模型相比,DLRM的GPU利用率较低。我们展示了设备活动时间(内核运行时的总和)和设备空闲时间都是整个设备时间的重要组成部分。因此,我们分别通过(1)灵活地采用基于启发式和基于ml的内核性能模型来解决这些问题,并(2)将操作员开销分为五种类型,以定量地确定它们对设备活动时间的贡献。结合这两部分,我们提出了一种基于关键路径的算法,通过遍历DLRM的执行图来预测每批DLRM的训练时间。我们在所有内核性能建模中实现了小于10%的几何平均误差(GMAE),在单个工作负载的开销下,GPU活动时间和整体E2E每批训练时间预测的几何误差分别为4.61%和7.96%。跨工作负载共享开销时,端到端加密预测误差略微增加2.19%,这表明在大规模预测中使用共享开销是可行的。我们的研究表明,我们的通用性能模型不仅在DLRM上实现了较低的预测误差,而且在大多数先前方法针对的其他计算绑定的ML模型上也产生了相当的精度。通过使用该性能模型、图级数据和任务依赖分析,我们证明了我们的系统可以提供比以前的方法更通用的模型-系统协同设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信