分布式深度学习应用的性能和一致性分析

2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC) Pub Date : 2020-11-06 DOI:10.1109/IPCCC50635.2020.9391566

Danlin Jia, M. Saha, J. Bhimani, N. Mi

{"title":"分布式深度学习应用的性能和一致性分析","authors":"Danlin Jia, M. Saha, J. Bhimani, N. Mi","doi":"10.1109/IPCCC50635.2020.9391566","DOIUrl":null,"url":null,"abstract":"Accelerating the training of Deep Neural Network (DNN) models is very important for successfully using deep learning techniques in fields like computer vision and speech recognition. Distributed frameworks help to speed up the training process for large DNN models and datasets. Plenty of works have been done to improve model accuracy and training efficiency, based on mathematical analysis of computations in the Con-volutional Neural Networks (CNN). However, to run distributed deep learning applications in the real world, users and developers need to consider the impacts of system resource distribution. In this work, we deploy a real distributed deep learning cluster with multiple virtual machines. We conduct an in-depth analysis to understand the impacts of system configurations, distribution typologies, and application parameters, on the latency and correctness of the distributed deep learning applications. We analyze the performance diversity under different model consistency and data parallelism by profiling run-time system utilization and tracking application activities. Based on our observations and analysis, we develop design guidelines for accelerating distributed deep-learning training on virtualized environments.","PeriodicalId":226034,"journal":{"name":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance and Consistency Analysis for Distributed Deep Learning Applications\",\"authors\":\"Danlin Jia, M. Saha, J. Bhimani, N. Mi\",\"doi\":\"10.1109/IPCCC50635.2020.9391566\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accelerating the training of Deep Neural Network (DNN) models is very important for successfully using deep learning techniques in fields like computer vision and speech recognition. Distributed frameworks help to speed up the training process for large DNN models and datasets. Plenty of works have been done to improve model accuracy and training efficiency, based on mathematical analysis of computations in the Con-volutional Neural Networks (CNN). However, to run distributed deep learning applications in the real world, users and developers need to consider the impacts of system resource distribution. In this work, we deploy a real distributed deep learning cluster with multiple virtual machines. We conduct an in-depth analysis to understand the impacts of system configurations, distribution typologies, and application parameters, on the latency and correctness of the distributed deep learning applications. We analyze the performance diversity under different model consistency and data parallelism by profiling run-time system utilization and tracking application activities. Based on our observations and analysis, we develop design guidelines for accelerating distributed deep-learning training on virtualized environments.\",\"PeriodicalId\":226034,\"journal\":{\"name\":\"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPCCC50635.2020.9391566\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPCCC50635.2020.9391566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

加速深度神经网络(DNN)模型的训练对于在计算机视觉和语音识别等领域成功使用深度学习技术非常重要。分布式框架有助于加快大型DNN模型和数据集的训练过程。基于卷积神经网络(CNN)计算的数学分析，在提高模型精度和训练效率方面已经做了大量的工作。然而，要在现实世界中运行分布式深度学习应用程序，用户和开发人员需要考虑系统资源分布的影响。在这项工作中，我们部署了一个具有多个虚拟机的真实分布式深度学习集群。我们深入分析了系统配置、分布类型和应用程序参数对分布式深度学习应用程序的延迟和正确性的影响。通过分析运行时系统利用率和跟踪应用程序活动，分析了不同模型一致性和数据并行性下的性能差异。基于我们的观察和分析，我们开发了加速虚拟化环境下分布式深度学习训练的设计指南。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance and Consistency Analysis for Distributed Deep Learning Applications

Accelerating the training of Deep Neural Network (DNN) models is very important for successfully using deep learning techniques in fields like computer vision and speech recognition. Distributed frameworks help to speed up the training process for large DNN models and datasets. Plenty of works have been done to improve model accuracy and training efficiency, based on mathematical analysis of computations in the Con-volutional Neural Networks (CNN). However, to run distributed deep learning applications in the real world, users and developers need to consider the impacts of system resource distribution. In this work, we deploy a real distributed deep learning cluster with multiple virtual machines. We conduct an in-depth analysis to understand the impacts of system configurations, distribution typologies, and application parameters, on the latency and correctness of the distributed deep learning applications. We analyze the performance diversity under different model consistency and data parallelism by profiling run-time system utilization and tracking application activities. Based on our observations and analysis, we develop design guidelines for accelerating distributed deep-learning training on virtualized environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)

自引率

0.00%

发文量