{"title":"Traffic-Aware In-Network Aggregation Placement for Multi-Tenant Distributed Machine Learning","authors":"H. Kim, Hochan Lee, Sangheon Pack","doi":"10.1109/ICCCN58024.2023.10230140","DOIUrl":null,"url":null,"abstract":"Distributed machine learning is an effective method to alleviate intensive computation costs of training; however it suffers from network bottlenecks while gathering local results. Recent advent of programmable data planes opened a new avenue, in-network aggregation, which executes gradient aggregations in the middle of the network resolving network bottlenecks and further accelerates distributed machine learning. However, due to resource-constrained features of current programmable data planes, installation of in-network aggregation functionalities throughout the network would impose unacceptable burden, posing a need for sophisticated deployment. In this paper, we consider a problem of deploying in-network aggregation functionalities, so as to minimize the total network traffic in multi-tenant distributed machine learning. Since the formulated problem is an integer linear programming problem, which is known as NP-hard, we propose a traffic aware placement of in-network aggregation (TAPINA) algorithm with lower complexity and near-optimal performance. TAPINA decides aggregation points of multiple tenants sequentially in order of their expected traffics and reuses the already selected aggregation points by other tenants to reduce the overall deployment cost. Simulation results demonstrate that TAPINA shows near-optimal performance, achieving up to 20 % traffic reduction compared to the state-of-the-art algorithm in most cases.","PeriodicalId":132030,"journal":{"name":"2023 32nd International Conference on Computer Communications and Networks (ICCCN)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 32nd International Conference on Computer Communications and Networks (ICCCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCN58024.2023.10230140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed machine learning is an effective method to alleviate intensive computation costs of training; however it suffers from network bottlenecks while gathering local results. Recent advent of programmable data planes opened a new avenue, in-network aggregation, which executes gradient aggregations in the middle of the network resolving network bottlenecks and further accelerates distributed machine learning. However, due to resource-constrained features of current programmable data planes, installation of in-network aggregation functionalities throughout the network would impose unacceptable burden, posing a need for sophisticated deployment. In this paper, we consider a problem of deploying in-network aggregation functionalities, so as to minimize the total network traffic in multi-tenant distributed machine learning. Since the formulated problem is an integer linear programming problem, which is known as NP-hard, we propose a traffic aware placement of in-network aggregation (TAPINA) algorithm with lower complexity and near-optimal performance. TAPINA decides aggregation points of multiple tenants sequentially in order of their expected traffics and reuses the already selected aggregation points by other tenants to reduce the overall deployment cost. Simulation results demonstrate that TAPINA shows near-optimal performance, achieving up to 20 % traffic reduction compared to the state-of-the-art algorithm in most cases.