Modeling Interprocessor Communication and Performance Scalability for Distributed Deep Learning Systems

2019 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2019-07-01 DOI:10.1109/HPCS48598.2019.9188168

Yi-Hong Lyu, C. Liu, Chen-Pang Lee, Chia-Heng Tu, Shih-Hao Hung

{"title":"Modeling Interprocessor Communication and Performance Scalability for Distributed Deep Learning Systems","authors":"Yi-Hong Lyu, C. Liu, Chen-Pang Lee, Chia-Heng Tu, Shih-Hao Hung","doi":"10.1109/HPCS48598.2019.9188168","DOIUrl":null,"url":null,"abstract":"While deep learning applications become popular, the design of deep learning systems is a critical task to unleash the computing power of underlying systems. Aside from the computing hardware, the computer networking is also a key factor that affects the delivered performance. When considering a large and complex model, the scalability of the system highly depends on the design of the networks, as well as the software behaviors. In this paper, we propose a profile-data-guided performance prediction method to estimate the performance of the system with desired high-speed interconnects, based on the profiling data obtained in a previous run. In particular, we leverage the open-source profiling tool, SOFA, for characterizing the software activities of deep learning software running in a computer cluster, and the characterized information is used to build the performance model for the model training process. When estimating the performance, SOFA is used to capture the performance critical factors for the model to make the predictions. To evaluate the proposed method, four popular deep learning models are adopted in our experiments, ResNet50, Inception3, Alexnet, and VGG16, where a computer cluster formed by four nodes is used to profile the training of the above models on TensorFlow. We ran the scalability analysis to analyze the size of the cluster, and the suitable computer networks for the models. By comparing the predicted data and those measured on the cluster, our model achieves up to 95% accuracy in most of the cases, with the maximum error rate of 10%.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS48598.2019.9188168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

While deep learning applications become popular, the design of deep learning systems is a critical task to unleash the computing power of underlying systems. Aside from the computing hardware, the computer networking is also a key factor that affects the delivered performance. When considering a large and complex model, the scalability of the system highly depends on the design of the networks, as well as the software behaviors. In this paper, we propose a profile-data-guided performance prediction method to estimate the performance of the system with desired high-speed interconnects, based on the profiling data obtained in a previous run. In particular, we leverage the open-source profiling tool, SOFA, for characterizing the software activities of deep learning software running in a computer cluster, and the characterized information is used to build the performance model for the model training process. When estimating the performance, SOFA is used to capture the performance critical factors for the model to make the predictions. To evaluate the proposed method, four popular deep learning models are adopted in our experiments, ResNet50, Inception3, Alexnet, and VGG16, where a computer cluster formed by four nodes is used to profile the training of the above models on TensorFlow. We ran the scalability analysis to analyze the size of the cluster, and the suitable computer networks for the models. By comparing the predicted data and those measured on the cluster, our model achieves up to 95% accuracy in most of the cases, with the maximum error rate of 10%.

查看原文本刊更多论文

分布式深度学习系统的处理器间通信建模和性能可扩展性

随着深度学习应用的普及，深度学习系统的设计是释放底层系统计算能力的关键任务。除了计算硬件之外，计算机网络也是影响交付性能的关键因素。当考虑一个大型和复杂的模型时，系统的可扩展性在很大程度上取决于网络的设计以及软件的行为。在本文中，我们提出了一种基于概要数据的性能预测方法，以估计具有所需高速互连的系统的性能，该方法基于先前运行中获得的概要数据。特别是，我们利用开源分析工具SOFA来描述在计算机集群中运行的深度学习软件的软件活动，并将特征信息用于构建模型训练过程的性能模型。在估计性能时，SOFA用于捕获性能关键因素，以便模型进行预测。为了评估所提出的方法，我们在实验中采用了四种流行的深度学习模型，ResNet50, Inception3, Alexnet和VGG16，其中使用由四个节点组成的计算机集群来描述上述模型在TensorFlow上的训练。我们运行可扩展性分析来分析集群的大小，以及适合模型的计算机网络。通过比较预测数据和在聚类上的实测数据，我们的模型在大多数情况下达到了95%的准确率，最大错误率为10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量