Lei Liu , Yong Wang , Yangfan Liang , Junqi Chen , Qian He
{"title":"In-network aggregation enabled multiple sub-blocks parallel repair in erasure-coded storage system","authors":"Lei Liu , Yong Wang , Yangfan Liang , Junqi Chen , Qian He","doi":"10.1016/j.comnet.2025.111523","DOIUrl":null,"url":null,"abstract":"<div><div>Erasure coding has gained widespread adoption in large-scale distributed storage systems since it can significantly reduce storage overhead while ensuring high reliability. However, repairing failed data in erasure-coded systems requires retrieving data from multiple nodes, which generates substantial network traffic, and often leads to incast congestion and degraded repair performance. Existing solutions alleviate requester-side congestion by offloading aggregation operations to helpers (nodes that provide repair data), but they inevitable increase inter-helper traffic and still struggle to fully utilize global network resources. To this end, we propose lnaPR (In-network Aggregation Enabled Parallel Repair for Multiple Sub-Blocks), a framework that leverages programmable switches to perform in-network aggregation during data repair. InaPR decomposes a data repair task into multiple tree-structured pipelines, enabling data repair to collect source data from more helpers beyond the fixed k-nodes requirement. Then, the bandwidth allocation for each pipeline is optimized through a two-stage methodology: (1) a heuristic helper allocation strategy that assigns high-bandwidth helpers across multiple pipelines while distributing low-capacity ones among distinct pipelines; (2) a throughput-maximizing bandwidth allocation formulated as a linear programming model. Furthermore, we also extend the architecture to cross-rack scenarios through virtual node decomposition. Finally, we prototype lnaPR using a P4-programmable switch and validate its performance in real-world evaluations and multi-rack simulations. Experimental results demonstrate that InaPR achieves 6.74% higher repair throughput than state-of-the-art methods in single-rack prototype tests and an 11.03% improvement in cross-rack simulations.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"270 ","pages":"Article 111523"},"PeriodicalIF":4.4000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625004906","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Erasure coding has gained widespread adoption in large-scale distributed storage systems since it can significantly reduce storage overhead while ensuring high reliability. However, repairing failed data in erasure-coded systems requires retrieving data from multiple nodes, which generates substantial network traffic, and often leads to incast congestion and degraded repair performance. Existing solutions alleviate requester-side congestion by offloading aggregation operations to helpers (nodes that provide repair data), but they inevitable increase inter-helper traffic and still struggle to fully utilize global network resources. To this end, we propose lnaPR (In-network Aggregation Enabled Parallel Repair for Multiple Sub-Blocks), a framework that leverages programmable switches to perform in-network aggregation during data repair. InaPR decomposes a data repair task into multiple tree-structured pipelines, enabling data repair to collect source data from more helpers beyond the fixed k-nodes requirement. Then, the bandwidth allocation for each pipeline is optimized through a two-stage methodology: (1) a heuristic helper allocation strategy that assigns high-bandwidth helpers across multiple pipelines while distributing low-capacity ones among distinct pipelines; (2) a throughput-maximizing bandwidth allocation formulated as a linear programming model. Furthermore, we also extend the architecture to cross-rack scenarios through virtual node decomposition. Finally, we prototype lnaPR using a P4-programmable switch and validate its performance in real-world evaluations and multi-rack simulations. Experimental results demonstrate that InaPR achieves 6.74% higher repair throughput than state-of-the-art methods in single-rack prototype tests and an 11.03% improvement in cross-rack simulations.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.