Lei Liu , Yong Wang , Yangfan Liang , Junqi Chen , Qian He
{"title":"网络内聚合实现了擦除编码存储系统的多个子块并行修复","authors":"Lei Liu , Yong Wang , Yangfan Liang , Junqi Chen , Qian He","doi":"10.1016/j.comnet.2025.111523","DOIUrl":null,"url":null,"abstract":"<div><div>Erasure coding has gained widespread adoption in large-scale distributed storage systems since it can significantly reduce storage overhead while ensuring high reliability. However, repairing failed data in erasure-coded systems requires retrieving data from multiple nodes, which generates substantial network traffic, and often leads to incast congestion and degraded repair performance. Existing solutions alleviate requester-side congestion by offloading aggregation operations to helpers (nodes that provide repair data), but they inevitable increase inter-helper traffic and still struggle to fully utilize global network resources. To this end, we propose lnaPR (In-network Aggregation Enabled Parallel Repair for Multiple Sub-Blocks), a framework that leverages programmable switches to perform in-network aggregation during data repair. InaPR decomposes a data repair task into multiple tree-structured pipelines, enabling data repair to collect source data from more helpers beyond the fixed k-nodes requirement. Then, the bandwidth allocation for each pipeline is optimized through a two-stage methodology: (1) a heuristic helper allocation strategy that assigns high-bandwidth helpers across multiple pipelines while distributing low-capacity ones among distinct pipelines; (2) a throughput-maximizing bandwidth allocation formulated as a linear programming model. Furthermore, we also extend the architecture to cross-rack scenarios through virtual node decomposition. Finally, we prototype lnaPR using a P4-programmable switch and validate its performance in real-world evaluations and multi-rack simulations. Experimental results demonstrate that InaPR achieves 6.74% higher repair throughput than state-of-the-art methods in single-rack prototype tests and an 11.03% improvement in cross-rack simulations.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"270 ","pages":"Article 111523"},"PeriodicalIF":4.4000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"In-network aggregation enabled multiple sub-blocks parallel repair in erasure-coded storage system\",\"authors\":\"Lei Liu , Yong Wang , Yangfan Liang , Junqi Chen , Qian He\",\"doi\":\"10.1016/j.comnet.2025.111523\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Erasure coding has gained widespread adoption in large-scale distributed storage systems since it can significantly reduce storage overhead while ensuring high reliability. However, repairing failed data in erasure-coded systems requires retrieving data from multiple nodes, which generates substantial network traffic, and often leads to incast congestion and degraded repair performance. Existing solutions alleviate requester-side congestion by offloading aggregation operations to helpers (nodes that provide repair data), but they inevitable increase inter-helper traffic and still struggle to fully utilize global network resources. To this end, we propose lnaPR (In-network Aggregation Enabled Parallel Repair for Multiple Sub-Blocks), a framework that leverages programmable switches to perform in-network aggregation during data repair. InaPR decomposes a data repair task into multiple tree-structured pipelines, enabling data repair to collect source data from more helpers beyond the fixed k-nodes requirement. Then, the bandwidth allocation for each pipeline is optimized through a two-stage methodology: (1) a heuristic helper allocation strategy that assigns high-bandwidth helpers across multiple pipelines while distributing low-capacity ones among distinct pipelines; (2) a throughput-maximizing bandwidth allocation formulated as a linear programming model. Furthermore, we also extend the architecture to cross-rack scenarios through virtual node decomposition. Finally, we prototype lnaPR using a P4-programmable switch and validate its performance in real-world evaluations and multi-rack simulations. Experimental results demonstrate that InaPR achieves 6.74% higher repair throughput than state-of-the-art methods in single-rack prototype tests and an 11.03% improvement in cross-rack simulations.</div></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":\"270 \",\"pages\":\"Article 111523\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1389128625004906\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625004906","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
摘要
Erasure编码可以在保证高可靠性的同时显著降低存储开销,因此在大规模分布式存储系统中得到了广泛的应用。然而,在擦除编码系统中修复失败的数据需要从多个节点检索数据,这会产生大量的网络流量,并且经常导致突发拥塞和修复性能下降。现有的解决方案通过将聚合操作卸载到helper(提供修复数据的节点)来缓解请求端拥塞,但它们不可避免地增加了helper之间的流量,并且仍然难以充分利用全局网络资源。为此,我们提出了lnaPR (In-network Aggregation Enabled Parallel Repair for Multiple sub - block),这是一个利用可编程交换机在数据修复期间执行网内聚合的框架。InaPR将数据修复任务分解为多个树状结构的管道,使数据修复能够从固定k节点需求之外的更多助手处收集源数据。然后,通过两阶段方法优化每个管道的带宽分配:(1)启发式助手分配策略,在多个管道中分配高带宽的助手,在不同的管道中分配低容量的助手;(2)以线性规划模型表述的吞吐量最大化带宽分配。此外,我们还通过虚拟节点分解将体系结构扩展到跨机架场景。最后,我们使用p4可编程开关对lnaPR进行了原型设计,并在实际评估和多机架仿真中验证了其性能。实验结果表明,在单机架原型测试中,InaPR比现有方法的修复吞吐量提高了6.74%,在跨机架模拟中提高了11.03%。
In-network aggregation enabled multiple sub-blocks parallel repair in erasure-coded storage system
Erasure coding has gained widespread adoption in large-scale distributed storage systems since it can significantly reduce storage overhead while ensuring high reliability. However, repairing failed data in erasure-coded systems requires retrieving data from multiple nodes, which generates substantial network traffic, and often leads to incast congestion and degraded repair performance. Existing solutions alleviate requester-side congestion by offloading aggregation operations to helpers (nodes that provide repair data), but they inevitable increase inter-helper traffic and still struggle to fully utilize global network resources. To this end, we propose lnaPR (In-network Aggregation Enabled Parallel Repair for Multiple Sub-Blocks), a framework that leverages programmable switches to perform in-network aggregation during data repair. InaPR decomposes a data repair task into multiple tree-structured pipelines, enabling data repair to collect source data from more helpers beyond the fixed k-nodes requirement. Then, the bandwidth allocation for each pipeline is optimized through a two-stage methodology: (1) a heuristic helper allocation strategy that assigns high-bandwidth helpers across multiple pipelines while distributing low-capacity ones among distinct pipelines; (2) a throughput-maximizing bandwidth allocation formulated as a linear programming model. Furthermore, we also extend the architecture to cross-rack scenarios through virtual node decomposition. Finally, we prototype lnaPR using a P4-programmable switch and validate its performance in real-world evaluations and multi-rack simulations. Experimental results demonstrate that InaPR achieves 6.74% higher repair throughput than state-of-the-art methods in single-rack prototype tests and an 11.03% improvement in cross-rack simulations.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.