在 SoC 集群边缘服务器上进行高效、可扩展和可持续的 DNN 训练

IF 7.7 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Mobile Computing Pub Date : 2024-08-13 DOI:10.1109/TMC.2024.3442430

Mengwei Xu;Daliang Xu;Chiheng Lou;Li Zhang;Gang Huang;Xin Jin;Xuanzhe Liu

{"title":"在 SoC 集群边缘服务器上进行高效、可扩展和可持续的 DNN 训练","authors":"Mengwei Xu;Daliang Xu;Chiheng Lou;Li Zhang;Gang Huang;Xin Jin;Xuanzhe Liu","doi":"10.1109/TMC.2024.3442430","DOIUrl":null,"url":null,"abstract":"In the realm of industrial edge computing, a novel server architecture known as SoC-Cluster, characterized by its aggregation of numerous mobile systems-on-chips (SoCs), has emerged as a promising solution owing to its enhanced energy efficiency and seamless integration with prevalent mobile applications. Despite its advantages, the utilization of SoC-Cluster servers remains unsatisfactory, primarily attributed to the tidal patterns of user-initiated workloads. To address such inefficiency, we introduce \n<monospace>SoCFlow+</monospace>\n, a pioneering framework designed to facilitate the co-location of deep learning training tasks on SoC-Cluster servers, thereby optimizing resource utilization. \n<monospace>SoCFlow+</monospace>\n incorporates three novel techniques tailored to mitigate the inherent limitations of commercial SoC-Cluster servers. First, it employs group-wise parallelism complemented by delayed aggregation, a strategy engineered to enhance the training efficiency and scalability of deep learning models, effectively circumventing network bottlenecks. Second, it integrates a data-parallel mixed-precision training algorithm, optimized to exploit the heterogeneous processing capabilities inherent to mobile SoCs fully. Third, \n<monospace>SoCFlow+</monospace>\n employs an underclocking-aware workload re-balanacing mechanism to tackle the training performance degradation caused by the thermal control of mobile SoCs. Through rigorous experimental validation, \n<monospace>SoCFlow+</monospace>\n achieves a convergence speedup ranging from 1.6× to 740× across 32 SoCs, compared to conventional benchmarks. Furthermore, when juxtaposed with commodity GPU servers (e.g., NVIDIA V100) under identical power constraints, \n<monospace>SoCFlow+</monospace>\n not only exhibits comparable training speed but also achieves a remarkable reduction in energy consumption by a factor of 2.31× to 10.23×, all while preserving convergence accuracy.","PeriodicalId":50389,"journal":{"name":"IEEE Transactions on Mobile Computing","volume":"23 12","pages":"14344-14360"},"PeriodicalIF":7.7000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient, Scalable, and Sustainable DNN Training on SoC-Clustered Edge Servers\",\"authors\":\"Mengwei Xu;Daliang Xu;Chiheng Lou;Li Zhang;Gang Huang;Xin Jin;Xuanzhe Liu\",\"doi\":\"10.1109/TMC.2024.3442430\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the realm of industrial edge computing, a novel server architecture known as SoC-Cluster, characterized by its aggregation of numerous mobile systems-on-chips (SoCs), has emerged as a promising solution owing to its enhanced energy efficiency and seamless integration with prevalent mobile applications. Despite its advantages, the utilization of SoC-Cluster servers remains unsatisfactory, primarily attributed to the tidal patterns of user-initiated workloads. To address such inefficiency, we introduce \\n<monospace>SoCFlow+</monospace>\\n, a pioneering framework designed to facilitate the co-location of deep learning training tasks on SoC-Cluster servers, thereby optimizing resource utilization. \\n<monospace>SoCFlow+</monospace>\\n incorporates three novel techniques tailored to mitigate the inherent limitations of commercial SoC-Cluster servers. First, it employs group-wise parallelism complemented by delayed aggregation, a strategy engineered to enhance the training efficiency and scalability of deep learning models, effectively circumventing network bottlenecks. Second, it integrates a data-parallel mixed-precision training algorithm, optimized to exploit the heterogeneous processing capabilities inherent to mobile SoCs fully. Third, \\n<monospace>SoCFlow+</monospace>\\n employs an underclocking-aware workload re-balanacing mechanism to tackle the training performance degradation caused by the thermal control of mobile SoCs. Through rigorous experimental validation, \\n<monospace>SoCFlow+</monospace>\\n achieves a convergence speedup ranging from 1.6× to 740× across 32 SoCs, compared to conventional benchmarks. Furthermore, when juxtaposed with commodity GPU servers (e.g., NVIDIA V100) under identical power constraints, \\n<monospace>SoCFlow+</monospace>\\n not only exhibits comparable training speed but also achieves a remarkable reduction in energy consumption by a factor of 2.31× to 10.23×, all while preserving convergence accuracy.\",\"PeriodicalId\":50389,\"journal\":{\"name\":\"IEEE Transactions on Mobile Computing\",\"volume\":\"23 12\",\"pages\":\"14344-14360\"},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Mobile Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10634823/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Mobile Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10634823/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在工业边缘计算领域，一种被称为 SoC-Cluster 的新型服务器架构（其特点是将众多移动片上系统（SoC）聚合在一起）因其更高的能效和与流行移动应用的无缝集成而成为一种前景广阔的解决方案。尽管SoC-Cluster服务器具有诸多优势，但其利用率仍不尽如人意，这主要归因于用户发起的工作负载的潮汐模式。为了解决这种低效率问题，我们推出了 SoCFlow+，这是一个开创性的框架，旨在促进深度学习训练任务在 SoC-Cluster 服务器上的共同定位，从而优化资源利用率。SoCFlow+ 融合了三项新技术，旨在减少商用 SoC-Cluster 服务器的固有限制。首先，它采用了分组并行技术，并辅以延迟聚合技术，这一策略旨在提高深度学习模型的训练效率和可扩展性，有效规避网络瓶颈。其次，它集成了数据并行混合精度训练算法，经过优化，可充分利用移动 SoC 固有的异构处理能力。第三，SoCFlow+ 采用了低频感知工作负载再平衡机制，以解决移动 SoC 的热控制造成的训练性能下降问题。通过严格的实验验证，与传统基准相比，SoCFlow+ 在 32 个 SoC 上实现了 1.6 倍到 740 倍的收敛速度提升。此外，在相同的功率限制条件下，将SoCFlow+与商品GPU服务器（如英伟达V100）进行对比时，SoCFlow+不仅表现出了相当的训练速度，而且在保持收敛准确性的同时，还将能耗显著降低了2.31倍至10.23倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient, Scalable, and Sustainable DNN Training on SoC-Clustered Edge Servers

In the realm of industrial edge computing, a novel server architecture known as SoC-Cluster, characterized by its aggregation of numerous mobile systems-on-chips (SoCs), has emerged as a promising solution owing to its enhanced energy efficiency and seamless integration with prevalent mobile applications. Despite its advantages, the utilization of SoC-Cluster servers remains unsatisfactory, primarily attributed to the tidal patterns of user-initiated workloads. To address such inefficiency, we introduce SoCFlow+ , a pioneering framework designed to facilitate the co-location of deep learning training tasks on SoC-Cluster servers, thereby optimizing resource utilization. SoCFlow+ incorporates three novel techniques tailored to mitigate the inherent limitations of commercial SoC-Cluster servers. First, it employs group-wise parallelism complemented by delayed aggregation, a strategy engineered to enhance the training efficiency and scalability of deep learning models, effectively circumventing network bottlenecks. Second, it integrates a data-parallel mixed-precision training algorithm, optimized to exploit the heterogeneous processing capabilities inherent to mobile SoCs fully. Third, SoCFlow+ employs an underclocking-aware workload re-balanacing mechanism to tackle the training performance degradation caused by the thermal control of mobile SoCs. Through rigorous experimental validation, SoCFlow+ achieves a convergence speedup ranging from 1.6× to 740× across 32 SoCs, compared to conventional benchmarks. Furthermore, when juxtaposed with commodity GPU servers (e.g., NVIDIA V100) under identical power constraints, SoCFlow+ not only exhibits comparable training speed but also achieves a remarkable reduction in energy consumption by a factor of 2.31× to 10.23×, all while preserving convergence accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Mobile Computing 工程技术-电信学

CiteScore

12.90

自引率

2.50%

发文量

403

审稿时长

6.6 months

期刊介绍： IEEE Transactions on Mobile Computing addresses key technical issues related to various aspects of mobile computing. This includes (a) architectures, (b) support services, (c) algorithm/protocol design and analysis, (d) mobile environments, (e) mobile communication systems, (f) applications, and (g) emerging technologies. Topics of interest span a wide range, covering aspects like mobile networks and hosts, mobility management, multimedia, operating system support, power management, online and mobile environments, security, scalability, reliability, and emerging technologies such as wearable computers, body area networks, and wireless sensor networks. The journal serves as a comprehensive platform for advancements in mobile computing research.