Lin Wang, Yuchong Hu, Qian Du, D. Feng, R. Wu, Ingo He, Kevin Zhang
{"title":"Exploiting Parallelism of Disk Failure Recovery via Partial Stripe Repair for an Erasure-Coded High-Density Storage Server","authors":"Lin Wang, Yuchong Hu, Qian Du, D. Feng, R. Wu, Ingo He, Kevin Zhang","doi":"10.1145/3545008.3545014","DOIUrl":null,"url":null,"abstract":"High-density storage servers (HDSSes), which pack many disks into single servers, are currently used in data centers to save costs (power, cooling, etc). Erasure coding, which stripes data and provides high availability guarantees, is also commonly deployed in data centers at lower cost than replication. However, when applying erasure coding to a single HDSS, we find that erasure coding’s state-of-the-art studies that improve repair performance in parallel mainly use multiple servers’ sufficient footprint, which is yet quite limited in the single HDSS, thus leading to a memory-competition issue for disk failure recovery. In this paper, for a single HDSS, we analyze its disk failure recovery’s parallelism which exists within each stripe (intra-stripe) and between stripes (inter-stripe), observe that the intra-stripe and inter-stripe parallelisms are mutually restrictive, and explore how they affect the disk failure recovery time. Based on the observations, we propose, for the HDSS, partial stripe repair (HD-PSR) schemes which exploit parallelism in both active and passive ways for single-disk recovery. We further propose a cooperative repair strategy to improve multi-disk recovery performance. We prototype HD-PSR and show via Amazon EC2 experiments that the recovery time of a single-disk failure and a multi-disk failure can be reduced by up to 71.7% and 52.5%, respectively, over existing erasure-coded repair scheme in high-density storage.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
High-density storage servers (HDSSes), which pack many disks into single servers, are currently used in data centers to save costs (power, cooling, etc). Erasure coding, which stripes data and provides high availability guarantees, is also commonly deployed in data centers at lower cost than replication. However, when applying erasure coding to a single HDSS, we find that erasure coding’s state-of-the-art studies that improve repair performance in parallel mainly use multiple servers’ sufficient footprint, which is yet quite limited in the single HDSS, thus leading to a memory-competition issue for disk failure recovery. In this paper, for a single HDSS, we analyze its disk failure recovery’s parallelism which exists within each stripe (intra-stripe) and between stripes (inter-stripe), observe that the intra-stripe and inter-stripe parallelisms are mutually restrictive, and explore how they affect the disk failure recovery time. Based on the observations, we propose, for the HDSS, partial stripe repair (HD-PSR) schemes which exploit parallelism in both active and passive ways for single-disk recovery. We further propose a cooperative repair strategy to improve multi-disk recovery performance. We prototype HD-PSR and show via Amazon EC2 experiments that the recovery time of a single-disk failure and a multi-disk failure can be reduced by up to 71.7% and 52.5%, respectively, over existing erasure-coded repair scheme in high-density storage.