No Worker Left (Too Far) Behind: Dynamic Hybrid Synchronization for In‐Network ML Aggregation

IF 1.5 4区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Diego Cardoso Nunes, Bruno Loureiro Coelho, Ricardo Parizotto, Alberto Egon Schaeffer‐Filho
{"title":"No Worker Left (Too Far) Behind: Dynamic Hybrid Synchronization for In‐Network ML Aggregation","authors":"Diego Cardoso Nunes, Bruno Loureiro Coelho, Ricardo Parizotto, Alberto Egon Schaeffer‐Filho","doi":"10.1002/nem.2290","DOIUrl":null,"url":null,"abstract":"Achieving high‐performance aggregation is essential to scaling data‐parallel distributed machine learning (ML) training. Recent research in in‐network computing has shown that offloading the aggregation to the network data plane can accelerate the aggregation process compared to traditional server‐only approaches, reducing the propagation delay and consequently speeding up distributed training. However, the existing literature on in‐network aggregation does not provide ways to deal with slower workers (called stragglers). The presence of stragglers can negatively impact distributed training, increasing the time it takes to complete. In this paper, we present Serene, an in‐network aggregation system capable of circumventing the effects of stragglers. Serene coordinates the ML workers to cooperate with a programmable switch using a hybrid synchronization approach where approaches can be changed dynamically. The synchronization can change dynamically through a control plane API that translates high‐level code into switch rules. Serene switch employs an efficient data structure for managing synchronization and a hot‐swapping mechanism to consistently change from one synchronization strategy to another. We implemented and evaluated a prototype using BMv2 and a Proof‐of‐Concept in a Tofino ASIC. We ran experiments with realistic ML workloads, including a neural network trained for image classification. Our results show that Serene can speed up training by up to 40% in emulation scenarios by reducing drastically the cumulative waiting time compared to a synchronous baseline.","PeriodicalId":14154,"journal":{"name":"International Journal of Network Management","volume":"22 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Network Management","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/nem.2290","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Achieving high‐performance aggregation is essential to scaling data‐parallel distributed machine learning (ML) training. Recent research in in‐network computing has shown that offloading the aggregation to the network data plane can accelerate the aggregation process compared to traditional server‐only approaches, reducing the propagation delay and consequently speeding up distributed training. However, the existing literature on in‐network aggregation does not provide ways to deal with slower workers (called stragglers). The presence of stragglers can negatively impact distributed training, increasing the time it takes to complete. In this paper, we present Serene, an in‐network aggregation system capable of circumventing the effects of stragglers. Serene coordinates the ML workers to cooperate with a programmable switch using a hybrid synchronization approach where approaches can be changed dynamically. The synchronization can change dynamically through a control plane API that translates high‐level code into switch rules. Serene switch employs an efficient data structure for managing synchronization and a hot‐swapping mechanism to consistently change from one synchronization strategy to another. We implemented and evaluated a prototype using BMv2 and a Proof‐of‐Concept in a Tofino ASIC. We ran experiments with realistic ML workloads, including a neural network trained for image classification. Our results show that Serene can speed up training by up to 40% in emulation scenarios by reducing drastically the cumulative waiting time compared to a synchronous baseline.
没有工人落在后面(太远):网络内 ML 聚合的动态混合同步
实现高性能聚合对于扩展数据并行分布式机器学习(ML)训练至关重要。最近的网络内计算研究表明,与传统的纯服务器方法相比,将聚合卸载到网络数据平面可以加速聚合过程,减少传播延迟,从而加快分布式训练。然而,关于网络内聚合的现有文献并没有提供处理速度较慢的工作者(称为 "游离者")的方法。散兵的存在会对分布式训练产生负面影响,增加训练完成所需的时间。在本文中,我们介绍了 Serene,一种能够规避散兵游勇影响的网内聚合系统。Serene 使用一种混合同步方法协调 ML 工作者与可编程交换机合作,这种方法可以动态改变。同步可通过将高级代码转换为交换规则的控制平面应用程序接口(API)动态更改。Serene switch 采用了一种高效的数据结构来管理同步,并采用了一种热插拔机制,可持续地从一种同步策略切换到另一种同步策略。我们使用 BMv2 实现并评估了一个原型,并在 Tofino ASIC 中进行了概念验证。我们使用现实的 ML 工作负载进行了实验,包括为图像分类而训练的神经网络。结果表明,与同步基线相比,Serene 通过大幅减少累计等待时间,可将仿真场景中的训练速度提高 40%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal of Network Management
International Journal of Network Management COMPUTER SCIENCE, INFORMATION SYSTEMS-TELECOMMUNICATIONS
CiteScore
5.10
自引率
6.70%
发文量
25
审稿时长
>12 weeks
期刊介绍: Modern computer networks and communication systems are increasing in size, scope, and heterogeneity. The promise of a single end-to-end technology has not been realized and likely never will occur. The decreasing cost of bandwidth is increasing the possible applications of computer networks and communication systems to entirely new domains. Problems in integrating heterogeneous wired and wireless technologies, ensuring security and quality of service, and reliably operating large-scale systems including the inclusion of cloud computing have all emerged as important topics. The one constant is the need for network management. Challenges in network management have never been greater than they are today. The International Journal of Network Management is the forum for researchers, developers, and practitioners in network management to present their work to an international audience. The journal is dedicated to the dissemination of information, which will enable improved management, operation, and maintenance of computer networks and communication systems. The journal is peer reviewed and publishes original papers (both theoretical and experimental) by leading researchers, practitioners, and consultants from universities, research laboratories, and companies around the world. Issues with thematic or guest-edited special topics typically occur several times per year. Topic areas for the journal are largely defined by the taxonomy for network and service management developed by IFIP WG6.6, together with IEEE-CNOM, the IRTF-NMRG and the Emanics Network of Excellence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信