边缘设备网络中基于数据和模型并行的分布式深度学习系统

2023 32nd International Conference on Computer Communications and Networks (ICCCN) Pub Date : 2023-07-01 DOI:10.1109/ICCCN58024.2023.10230190

Tanmoy Sen, Haiying Shen

{"title":"边缘设备网络中基于数据和模型并行的分布式深度学习系统","authors":"Tanmoy Sen, Haiying Shen","doi":"10.1109/ICCCN58024.2023.10230190","DOIUrl":null,"url":null,"abstract":"With the emergence of edge computing along with its local computation advantage over the cloud, methods for distributed deep learning (DL) training on edge nodes have been proposed. The increasing scale of DL models and large training dataset poses a challenge to run such jobs in one edge node due to resource constraints. However, the proposed methods either run the entire model in one edge node, collect all training data into one edge node, or still involve the remote cloud. To handle the challenge, we propose a fully distributed training system that realizes both Data and Model Parallelism over a network of edge devices (called DMP). It clusters the edge nodes to build a training structure by taking advantage of the feature that distributed edge nodes sense data for training. For each cluster, we propose a heuristic and a Reinforcement Learning (RL) based algorithm to handle the problem of how to partition a DL model and assign the partitions to edge nodes for model parallelism to minimize the overall training time. Taking advantage of the feature that geographically close edge nodes sense similar data, we further propose two schemes to avoid transferring duplicated data to the first-layer edge node as training data without compromising accuracy. Our container-based emulation and real edge node experiments show that our systems reduce up to 44% training time while maintaining the accuracy comparing with the state-of-the-art approaches. We also open sourced our source code.","PeriodicalId":132030,"journal":{"name":"2023 32nd International Conference on Computer Communications and Networks (ICCCN)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Data and Model Parallelism based Distributed Deep Learning System in a Network of Edge Devices\",\"authors\":\"Tanmoy Sen, Haiying Shen\",\"doi\":\"10.1109/ICCCN58024.2023.10230190\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the emergence of edge computing along with its local computation advantage over the cloud, methods for distributed deep learning (DL) training on edge nodes have been proposed. The increasing scale of DL models and large training dataset poses a challenge to run such jobs in one edge node due to resource constraints. However, the proposed methods either run the entire model in one edge node, collect all training data into one edge node, or still involve the remote cloud. To handle the challenge, we propose a fully distributed training system that realizes both Data and Model Parallelism over a network of edge devices (called DMP). It clusters the edge nodes to build a training structure by taking advantage of the feature that distributed edge nodes sense data for training. For each cluster, we propose a heuristic and a Reinforcement Learning (RL) based algorithm to handle the problem of how to partition a DL model and assign the partitions to edge nodes for model parallelism to minimize the overall training time. Taking advantage of the feature that geographically close edge nodes sense similar data, we further propose two schemes to avoid transferring duplicated data to the first-layer edge node as training data without compromising accuracy. Our container-based emulation and real edge node experiments show that our systems reduce up to 44% training time while maintaining the accuracy comparing with the state-of-the-art approaches. We also open sourced our source code.\",\"PeriodicalId\":132030,\"journal\":{\"name\":\"2023 32nd International Conference on Computer Communications and Networks (ICCCN)\",\"volume\":\"115 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 32nd International Conference on Computer Communications and Networks (ICCCN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCCN58024.2023.10230190\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 32nd International Conference on Computer Communications and Networks (ICCCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCN58024.2023.10230190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

随着边缘计算的出现及其相对于云的局部计算优势，人们提出了在边缘节点上进行分布式深度学习训练的方法。由于资源的限制，深度学习模型和大型训练数据集的规模不断扩大，对在一个边缘节点上运行此类作业提出了挑战。然而，所提出的方法要么在一个边缘节点上运行整个模型，要么将所有训练数据收集到一个边缘节点上，要么仍然涉及远程云。为了应对这一挑战，我们提出了一个完全分布式的训练系统，该系统在边缘设备网络(称为DMP)上实现数据和模型并行性。利用边缘节点感知数据的分布式特征，对边缘节点进行聚类，构建训练结构。对于每个聚类，我们提出了一种启发式和基于强化学习(RL)的算法来处理如何划分深度学习模型的问题，并将分区分配给边缘节点以实现模型并行性，从而最小化总体训练时间。利用地理上相近的边缘节点感知相似数据的特点，我们进一步提出了两种避免将重复数据作为训练数据传输到第一层边缘节点而不影响准确性的方案。我们基于容器的仿真和真实边缘节点实验表明，与最先进的方法相比，我们的系统在保持准确性的同时减少了高达44%的训练时间。我们还开放了源代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Data and Model Parallelism based Distributed Deep Learning System in a Network of Edge Devices

With the emergence of edge computing along with its local computation advantage over the cloud, methods for distributed deep learning (DL) training on edge nodes have been proposed. The increasing scale of DL models and large training dataset poses a challenge to run such jobs in one edge node due to resource constraints. However, the proposed methods either run the entire model in one edge node, collect all training data into one edge node, or still involve the remote cloud. To handle the challenge, we propose a fully distributed training system that realizes both Data and Model Parallelism over a network of edge devices (called DMP). It clusters the edge nodes to build a training structure by taking advantage of the feature that distributed edge nodes sense data for training. For each cluster, we propose a heuristic and a Reinforcement Learning (RL) based algorithm to handle the problem of how to partition a DL model and assign the partitions to edge nodes for model parallelism to minimize the overall training time. Taking advantage of the feature that geographically close edge nodes sense similar data, we further propose two schemes to avoid transferring duplicated data to the first-layer edge node as training data without compromising accuracy. Our container-based emulation and real edge node experiments show that our systems reduce up to 44% training time while maintaining the accuracy comparing with the state-of-the-art approaches. We also open sourced our source code.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 32nd International Conference on Computer Communications and Networks (ICCCN)

自引率

0.00%

发文量