Yunfeng Zhu, P. Lee, Yuchong Hu, Liping Xiang, Yinlong Xu
{"title":"xor编码存储系统中单盘故障恢复的加速:理论与实践","authors":"Yunfeng Zhu, P. Lee, Yuchong Hu, Liping Xiang, Yinlong Xu","doi":"10.1109/MSST.2012.6232371","DOIUrl":null,"url":null,"abstract":"Modern storage systems stripe redundant data across multiple disks to provide availability guarantees against disk failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to providing failure tolerance, a storage system must also provide fast failure recovery to avoid data unavailability. We consider the problem of speeding up the recovery of a single-disk failure for arbitrary XOR-based erasure codes. We address this problem from both theoretical and practical perspectives. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further implement our replace recovery algorithm atop a parallelized architecture to justify its practicality. We experiment our replace recovery algorithm and its parallelized implementation on a networked storage system testbed, and demonstrate that our replace recovery algorithm uses less recovery time than the conventional approach.","PeriodicalId":348234,"journal":{"name":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":"{\"title\":\"On the speedup of single-disk failure recovery in XOR-coded storage systems: Theory and practice\",\"authors\":\"Yunfeng Zhu, P. Lee, Yuchong Hu, Liping Xiang, Yinlong Xu\",\"doi\":\"10.1109/MSST.2012.6232371\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern storage systems stripe redundant data across multiple disks to provide availability guarantees against disk failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to providing failure tolerance, a storage system must also provide fast failure recovery to avoid data unavailability. We consider the problem of speeding up the recovery of a single-disk failure for arbitrary XOR-based erasure codes. We address this problem from both theoretical and practical perspectives. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further implement our replace recovery algorithm atop a parallelized architecture to justify its practicality. We experiment our replace recovery algorithm and its parallelized implementation on a networked storage system testbed, and demonstrate that our replace recovery algorithm uses less recovery time than the conventional approach.\",\"PeriodicalId\":348234,\"journal\":{\"name\":\"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"40\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MSST.2012.6232371\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2012.6232371","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On the speedup of single-disk failure recovery in XOR-coded storage systems: Theory and practice
Modern storage systems stripe redundant data across multiple disks to provide availability guarantees against disk failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to providing failure tolerance, a storage system must also provide fast failure recovery to avoid data unavailability. We consider the problem of speeding up the recovery of a single-disk failure for arbitrary XOR-based erasure codes. We address this problem from both theoretical and practical perspectives. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further implement our replace recovery algorithm atop a parallelized architecture to justify its practicality. We experiment our replace recovery algorithm and its parallelized implementation on a networked storage system testbed, and demonstrate that our replace recovery algorithm uses less recovery time than the conventional approach.