Exploiting Parallelism of Disk Failure Recovery via Partial Stripe Repair for an Erasure-Coded High-Density Storage Server

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545014

Lin Wang, Yuchong Hu, Qian Du, D. Feng, R. Wu, Ingo He, Kevin Zhang

{"title":"Exploiting Parallelism of Disk Failure Recovery via Partial Stripe Repair for an Erasure-Coded High-Density Storage Server","authors":"Lin Wang, Yuchong Hu, Qian Du, D. Feng, R. Wu, Ingo He, Kevin Zhang","doi":"10.1145/3545008.3545014","DOIUrl":null,"url":null,"abstract":"High-density storage servers (HDSSes), which pack many disks into single servers, are currently used in data centers to save costs (power, cooling, etc). Erasure coding, which stripes data and provides high availability guarantees, is also commonly deployed in data centers at lower cost than replication. However, when applying erasure coding to a single HDSS, we find that erasure coding’s state-of-the-art studies that improve repair performance in parallel mainly use multiple servers’ sufficient footprint, which is yet quite limited in the single HDSS, thus leading to a memory-competition issue for disk failure recovery. In this paper, for a single HDSS, we analyze its disk failure recovery’s parallelism which exists within each stripe (intra-stripe) and between stripes (inter-stripe), observe that the intra-stripe and inter-stripe parallelisms are mutually restrictive, and explore how they affect the disk failure recovery time. Based on the observations, we propose, for the HDSS, partial stripe repair (HD-PSR) schemes which exploit parallelism in both active and passive ways for single-disk recovery. We further propose a cooperative repair strategy to improve multi-disk recovery performance. We prototype HD-PSR and show via Amazon EC2 experiments that the recovery time of a single-disk failure and a multi-disk failure can be reduced by up to 71.7% and 52.5%, respectively, over existing erasure-coded repair scheme in high-density storage.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

High-density storage servers (HDSSes), which pack many disks into single servers, are currently used in data centers to save costs (power, cooling, etc). Erasure coding, which stripes data and provides high availability guarantees, is also commonly deployed in data centers at lower cost than replication. However, when applying erasure coding to a single HDSS, we find that erasure coding’s state-of-the-art studies that improve repair performance in parallel mainly use multiple servers’ sufficient footprint, which is yet quite limited in the single HDSS, thus leading to a memory-competition issue for disk failure recovery. In this paper, for a single HDSS, we analyze its disk failure recovery’s parallelism which exists within each stripe (intra-stripe) and between stripes (inter-stripe), observe that the intra-stripe and inter-stripe parallelisms are mutually restrictive, and explore how they affect the disk failure recovery time. Based on the observations, we propose, for the HDSS, partial stripe repair (HD-PSR) schemes which exploit parallelism in both active and passive ways for single-disk recovery. We further propose a cooperative repair strategy to improve multi-disk recovery performance. We prototype HD-PSR and show via Amazon EC2 experiments that the recovery time of a single-disk failure and a multi-disk failure can be reduced by up to 71.7% and 52.5%, respectively, over existing erasure-coded repair scheme in high-density storage.

查看原文本刊更多论文

基于擦除编码高密度存储服务器的部分条带修复磁盘故障恢复并行性研究

高密度存储服务器(hsdb)将许多磁盘打包到单个服务器中，目前用于数据中心以节省成本(电力、冷却等)。Erasure编码可以对数据进行条带化，并提供高可用性保证，在数据中心中也经常部署，其成本低于复制。然而，当将擦除编码应用于单个HDSS时，我们发现擦除编码提高并行修复性能的最新研究主要使用多个服务器的足够内存占用，而这在单个HDSS中仍然相当有限，从而导致磁盘故障恢复的内存竞争问题。本文以单个HDSS为例，分析了其磁盘故障恢复的并行性，并行性存在于每个条带内(条带内)和条带之间(条带间)，观察到条带内和条带间的并行性是相互限制的，并探讨了它们对磁盘故障恢复时间的影响。基于观察，我们提出，对于HDSS，部分条纹修复(HD-PSR)方案，利用并行性在主动和被动的方式进行单磁盘恢复。我们进一步提出了一种协作修复策略，以提高多磁盘恢复性能。我们对HD-PSR进行了原型设计，并通过Amazon EC2实验表明，与高密度存储中现有的擦除编码修复方案相比，单磁盘故障和多磁盘故障的恢复时间分别减少了71.7%和52.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量