基于阿里巴巴- pai的深度学习训练工作量表征

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-10-14 DOI:10.1109/IISWC47752.2019.9042047

Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, Yangqing Jia

{"title":"基于阿里巴巴- pai的深度学习训练工作量表征","authors":"Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, Yangqing Jia","doi":"10.1109/IISWC47752.2019.9042047","DOIUrl":null,"url":null,"abstract":"Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"32 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":"{\"title\":\"Characterizing Deep Learning Training Workloads on Alibaba-PAI\",\"authors\":\"Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, Yangqing Jia\",\"doi\":\"10.1109/IISWC47752.2019.9042047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.\",\"PeriodicalId\":121068,\"journal\":{\"name\":\"2019 IEEE International Symposium on Workload Characterization (IISWC)\",\"volume\":\"32 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"35\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Symposium on Workload Characterization (IISWC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISWC47752.2019.9042047\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC47752.2019.9042047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 35

摘要

现代深度学习模型已经在各个领域得到了应用，包括计算机视觉(CV)、自然语言处理(NLP)、搜索和推荐。在实际的人工智能集群中，训练这些模型的工作负载使用TensorFlow、Caffe、PyTorch和CNTK等软件框架运行。有效运行实际人工智能云的一个关键问题是表征这些工作负载的计算和数据传输需求，更重要的是，给定底层软件框架和硬件配置的训练性能。在本文中，我们描述了来自阿里巴巴人工智能平台(PAI)的深度学习训练工作负载。我们建立了一个分析框架来调查使用不同训练架构的各种工作负载的详细执行时间分解，以确定性能瓶颈。结果表明，在所有工作负载中，训练期间的权重/梯度通信平均花费了几乎62%的总执行时间。计算部分，包括GPU计算和内存访问，并不是基于工作负载集体行为的最大瓶颈。我们进一步评估了在各种潜在的软件/硬件映射上可实现的工作负载性能，并探讨了对软件架构选择和硬件配置的影响。我们发现，60%的PS/Worker工作负载在移植到利用高速NVLink进行GPU互连的AllReduce架构时可能会加速，当以太网带宽从25 Gbps升级到100 Gbps时，平均可以实现1.7倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Characterizing Deep Learning Training Workloads on Alibaba-PAI

Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量