Reliability Analysis of Highly Redundant Distributed Storage Systems with Dynamic Refuging

Hiroaki Akutsu, K. Ueda, Takeru Chiba, Tomohiro Kawaguchi, Norio Shimozono
{"title":"Reliability Analysis of Highly Redundant Distributed Storage Systems with Dynamic Refuging","authors":"Hiroaki Akutsu, K. Ueda, Takeru Chiba, Tomohiro Kawaguchi, Norio Shimozono","doi":"10.1109/PDP.2015.32","DOIUrl":null,"url":null,"abstract":"In recent data centres, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed storage areas from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamically changing amount of storage at each redundancy level due to multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, we found that the probability of data loss decreased by two orders of magnitude for systems with 384 or more drives compared to normal RAID. This technique turned out to scale well, and a system with 1536 inexpensive drives attained lower data loss probability than RAID 6 with 16 enterprise-class drives.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2015.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

In recent data centres, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed storage areas from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamically changing amount of storage at each redundancy level due to multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, we found that the probability of data loss decreased by two orders of magnitude for systems with 384 or more drives compared to normal RAID. This technique turned out to scale well, and a system with 1536 inexpensive drives attained lower data loss probability than RAID 6 with 16 enterprise-class drives.
考虑动态避难的高冗余分布式存储系统可靠性分析
在最近的数据中心中,存储大数据的大型存储系统由数千个大容量驱动器组成。我们的目标是建立一种方法,使用超过一千个低成本的大容量驱动器来构建高度可靠的存储系统。一些大型存储系统采用擦除编码来保护数据,防止数据丢失。随着纠删编码冗余程度的提高,数据丢失的概率会降低,但会增加正常的数据写操作,增加编码所需的额外存储空间。因此,我们需要在尽可能低的冗余级别上实现高可靠性。关于大型存储系统的可靠性,有两个问题:(i)随着驱动器数量的增加,系统更容易受到多个驱动器故障的影响;(ii)在许多驱动器中分配条带可以加快重建时间,但增加了由于多个驱动器故障而导致数据丢失的风险。这些问题在以前基于现实环境的定量可靠性研究中没有得到解决。本文分析了具有分布式条带的大型存储系统的可靠性,重点研究了一种有效的重构方法——动态避难。动态庇护从冗余度最低的存储区域重建故障存储区域,并策略性地选择读取块以修复丢失的数据。我们建立了由于多个驱动器故障而在每个冗余级别上动态变化的存储量的模型,并使用实际驱动器故障特征进行了蒙特卡罗模拟的可靠性分析。当冗余级别为3的条带充分分布并通过动态避难进行重建时,我们发现,与普通RAID相比,具有384个或更多驱动器的系统的数据丢失概率降低了两个数量级。事实证明,这种技术具有良好的可扩展性,具有1536个廉价驱动器的系统比具有16个企业级驱动器的RAID 6具有更低的数据丢失概率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信