Traffic-Aware In-Network Aggregation Placement for Multi-Tenant Distributed Machine Learning

2023 32nd International Conference on Computer Communications and Networks (ICCCN) Pub Date : 2023-07-01 DOI:10.1109/ICCCN58024.2023.10230140

H. Kim, Hochan Lee, Sangheon Pack

{"title":"Traffic-Aware In-Network Aggregation Placement for Multi-Tenant Distributed Machine Learning","authors":"H. Kim, Hochan Lee, Sangheon Pack","doi":"10.1109/ICCCN58024.2023.10230140","DOIUrl":null,"url":null,"abstract":"Distributed machine learning is an effective method to alleviate intensive computation costs of training; however it suffers from network bottlenecks while gathering local results. Recent advent of programmable data planes opened a new avenue, in-network aggregation, which executes gradient aggregations in the middle of the network resolving network bottlenecks and further accelerates distributed machine learning. However, due to resource-constrained features of current programmable data planes, installation of in-network aggregation functionalities throughout the network would impose unacceptable burden, posing a need for sophisticated deployment. In this paper, we consider a problem of deploying in-network aggregation functionalities, so as to minimize the total network traffic in multi-tenant distributed machine learning. Since the formulated problem is an integer linear programming problem, which is known as NP-hard, we propose a traffic aware placement of in-network aggregation (TAPINA) algorithm with lower complexity and near-optimal performance. TAPINA decides aggregation points of multiple tenants sequentially in order of their expected traffics and reuses the already selected aggregation points by other tenants to reduce the overall deployment cost. Simulation results demonstrate that TAPINA shows near-optimal performance, achieving up to 20 % traffic reduction compared to the state-of-the-art algorithm in most cases.","PeriodicalId":132030,"journal":{"name":"2023 32nd International Conference on Computer Communications and Networks (ICCCN)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 32nd International Conference on Computer Communications and Networks (ICCCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCN58024.2023.10230140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed machine learning is an effective method to alleviate intensive computation costs of training; however it suffers from network bottlenecks while gathering local results. Recent advent of programmable data planes opened a new avenue, in-network aggregation, which executes gradient aggregations in the middle of the network resolving network bottlenecks and further accelerates distributed machine learning. However, due to resource-constrained features of current programmable data planes, installation of in-network aggregation functionalities throughout the network would impose unacceptable burden, posing a need for sophisticated deployment. In this paper, we consider a problem of deploying in-network aggregation functionalities, so as to minimize the total network traffic in multi-tenant distributed machine learning. Since the formulated problem is an integer linear programming problem, which is known as NP-hard, we propose a traffic aware placement of in-network aggregation (TAPINA) algorithm with lower complexity and near-optimal performance. TAPINA decides aggregation points of multiple tenants sequentially in order of their expected traffics and reuses the already selected aggregation points by other tenants to reduce the overall deployment cost. Simulation results demonstrate that TAPINA shows near-optimal performance, achieving up to 20 % traffic reduction compared to the state-of-the-art algorithm in most cases.

查看原文本刊更多论文

面向多租户分布式机器学习的流量感知网络聚合布局

分布式机器学习是一种有效的方法，可以减少密集的训练计算成本;然而，它在收集本地结果时受到网络瓶颈的困扰。最近可编程数据平面的出现开辟了一条新的途径，即网络内聚合，它在网络中间执行梯度聚合，解决了网络瓶颈，进一步加速了分布式机器学习。但是，由于当前可编程数据平面的资源限制特点，在整个网络中安装网络内聚合功能将造成不可接受的负担，因此需要进行复杂的部署。在本文中，我们考虑了一个部署网络内聚合功能的问题，以最小化多租户分布式机器学习中的总网络流量。由于公式化问题是一个整数线性规划问题，被称为NP-hard，我们提出了一种具有较低复杂性和接近最优性能的流量感知的网络内聚合(TAPINA)算法。TAPINA按照多个租户的预期流量顺序决定聚合点，并重用其他租户已经选择的聚合点，以降低总体部署成本。仿真结果表明，TAPINA表现出近乎最佳的性能，在大多数情况下，与最先进的算法相比，可实现高达20%的流量减少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 32nd International Conference on Computer Communications and Networks (ICCCN)

自引率

0.00%

发文量