Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens
{"title":"Building a Performance Model for Deep Learning Recommendation Model Training on GPUs","authors":"Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens","doi":"10.1109/ISPASS55109.2022.00030","DOIUrl":null,"url":null,"abstract":"We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), which has low GPU utilization (i.e., the percentage of per-batch training time when kernels are running on the device) compared to other well-optimized vision (CV) and natural language processing (NLP) models. We show that both the device active time (the sum of kernel runtimes) and idle time are important components of the overall device time, and can be tackled separately by (1) flexibly adopting heuristic- and ML-based kernel performance models for kernels that dominate the device active time, and (2) categorizing operator overheads into five types to quantitatively determine their contribution to the overall device time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean absolute error (GMAE) in all kernel performance modeling, and 5.23% and 7.96% geomean errors, respectively, for GPU active time and overall end-to-end per-batch training time prediction on the highly-customized and multi-factor dominated DLRM architectures. We also demonstrate our performance model’s ability to generalize to other compute-bound DL models targeted by most previous methods and better assist general model-system co-design than previous work.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS55109.2022.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), which has low GPU utilization (i.e., the percentage of per-batch training time when kernels are running on the device) compared to other well-optimized vision (CV) and natural language processing (NLP) models. We show that both the device active time (the sum of kernel runtimes) and idle time are important components of the overall device time, and can be tackled separately by (1) flexibly adopting heuristic- and ML-based kernel performance models for kernels that dominate the device active time, and (2) categorizing operator overheads into five types to quantitatively determine their contribution to the overall device time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean absolute error (GMAE) in all kernel performance modeling, and 5.23% and 7.96% geomean errors, respectively, for GPU active time and overall end-to-end per-batch training time prediction on the highly-customized and multi-factor dominated DLRM architectures. We also demonstrate our performance model’s ability to generalize to other compute-bound DL models targeted by most previous methods and better assist general model-system co-design than previous work.