Quentin G. Anthony, Lang Xu, H. Subramoni, D. Panda
{"title":"在现代HPC集群上扩展单图像超分辨率训练:早期经验","authors":"Quentin G. Anthony, Lang Xu, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW52791.2021.00143","DOIUrl":null,"url":null,"abstract":"Deep Learning (DL) models for super-resolution (DLSR) are an emerging trend in response to the growth of ML/DL applications requiring high-resolution images. DLSR methods have also shown promise in domains such as medical imaging, surveillance, and microscopy. However, DLSR models are extremely computationally demanding, and require unreasonably long training times on modern Volta GPUs. In our experiments, we observed only 10.3 images/second on a single Volta GPU for training EDSR, a state-of-the-art DLSR model for single-image super-resolution. In comparison, a Volta GPU can process 360 images/second while training ResNet-50, a state-of-the-art model for image classification. Therefore, we believe supercomputers provide a good candidate to speed up DLSR model training. In this paper, we select EDSR as the representative DLSR PyTorch model. Further, we introduce Horovod-based distributed EDSR training. However, we observed poor default EDSR scaling performance on the Lassen HPC system at Lawrence Livermore National Laboratory. To investigate the performance degradations, we perform exhaustive communication profiling. These profiling insights are then used to optimize CUDA-Aware MPI for DLSR models by ensuring advanced MPI designs involving CUDA IPC and registration caching are properly applied by DL frameworks. We present a comprehensive scaling study of EDSR with MVAPICH2-GDR and NCCL up to 512 GPUs on Lassen. We demonstrate an improvement in scaling efficiency by 15.6% over default Horovod training, which translates to a 1.26× speedup in training performance.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences\",\"authors\":\"Quentin G. Anthony, Lang Xu, H. Subramoni, D. Panda\",\"doi\":\"10.1109/IPDPSW52791.2021.00143\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Learning (DL) models for super-resolution (DLSR) are an emerging trend in response to the growth of ML/DL applications requiring high-resolution images. DLSR methods have also shown promise in domains such as medical imaging, surveillance, and microscopy. However, DLSR models are extremely computationally demanding, and require unreasonably long training times on modern Volta GPUs. In our experiments, we observed only 10.3 images/second on a single Volta GPU for training EDSR, a state-of-the-art DLSR model for single-image super-resolution. In comparison, a Volta GPU can process 360 images/second while training ResNet-50, a state-of-the-art model for image classification. Therefore, we believe supercomputers provide a good candidate to speed up DLSR model training. In this paper, we select EDSR as the representative DLSR PyTorch model. Further, we introduce Horovod-based distributed EDSR training. However, we observed poor default EDSR scaling performance on the Lassen HPC system at Lawrence Livermore National Laboratory. To investigate the performance degradations, we perform exhaustive communication profiling. These profiling insights are then used to optimize CUDA-Aware MPI for DLSR models by ensuring advanced MPI designs involving CUDA IPC and registration caching are properly applied by DL frameworks. We present a comprehensive scaling study of EDSR with MVAPICH2-GDR and NCCL up to 512 GPUs on Lassen. We demonstrate an improvement in scaling efficiency by 15.6% over default Horovod training, which translates to a 1.26× speedup in training performance.\",\"PeriodicalId\":170832,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW52791.2021.00143\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences
Deep Learning (DL) models for super-resolution (DLSR) are an emerging trend in response to the growth of ML/DL applications requiring high-resolution images. DLSR methods have also shown promise in domains such as medical imaging, surveillance, and microscopy. However, DLSR models are extremely computationally demanding, and require unreasonably long training times on modern Volta GPUs. In our experiments, we observed only 10.3 images/second on a single Volta GPU for training EDSR, a state-of-the-art DLSR model for single-image super-resolution. In comparison, a Volta GPU can process 360 images/second while training ResNet-50, a state-of-the-art model for image classification. Therefore, we believe supercomputers provide a good candidate to speed up DLSR model training. In this paper, we select EDSR as the representative DLSR PyTorch model. Further, we introduce Horovod-based distributed EDSR training. However, we observed poor default EDSR scaling performance on the Lassen HPC system at Lawrence Livermore National Laboratory. To investigate the performance degradations, we perform exhaustive communication profiling. These profiling insights are then used to optimize CUDA-Aware MPI for DLSR models by ensuring advanced MPI designs involving CUDA IPC and registration caching are properly applied by DL frameworks. We present a comprehensive scaling study of EDSR with MVAPICH2-GDR and NCCL up to 512 GPUs on Lassen. We demonstrate an improvement in scaling efficiency by 15.6% over default Horovod training, which translates to a 1.26× speedup in training performance.