Achieving High Reliability via Expediting the Repair of Critical Blocks in Replicated Storage Systems

Juntao Fang, Shenggang Wan, Ping-Hsiu Huang, Xubin He, C. Xie
{"title":"Achieving High Reliability via Expediting the Repair of Critical Blocks in Replicated Storage Systems","authors":"Juntao Fang, Shenggang Wan, Ping-Hsiu Huang, Xubin He, C. Xie","doi":"10.1109/SRDS.2016.018","DOIUrl":null,"url":null,"abstract":"High reliability is critical to large data centers consisting of hundreds to thousands of storage nodes where node failures are not rare. Data replication is a typical technique deployed to achieve high reliability. When a node failure is detected, blocks with lost replicas are identified and recovered. Long timeouts are usually used for node failure detection. For blocks with one lost replica, the long timeouts can significantly reduce network traffic induced by data recovery. However, for blocks with two or more lost replicas, which can be caused by concurrent node failures that are not rare in large data centers, the long timeouts will result in a high risk of loss of these blocks. In this paper, we propose MFR to separate the identification of the blocks with two or more lost replicas from that of the blocks with one lost replica in a way that the identification of the blocks with two or more replicas can be accelerated while that of the blocks with one lost replica stays the same. Consequently, MFR can significantly improve data reliability while keeping the network traffic induced by data recovery stable. The results from our simulation and prototype implementation show that MFR improves the reliability of storage systems by a factor of up to 4.0 in terms of mean time to data loss. As blocks with two or more lost replicas are far fewer than blocks with one lost replica, the extra network traffic caused by MFR is less than 0.54% of total network traffic for data recovery.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SRDS.2016.018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

High reliability is critical to large data centers consisting of hundreds to thousands of storage nodes where node failures are not rare. Data replication is a typical technique deployed to achieve high reliability. When a node failure is detected, blocks with lost replicas are identified and recovered. Long timeouts are usually used for node failure detection. For blocks with one lost replica, the long timeouts can significantly reduce network traffic induced by data recovery. However, for blocks with two or more lost replicas, which can be caused by concurrent node failures that are not rare in large data centers, the long timeouts will result in a high risk of loss of these blocks. In this paper, we propose MFR to separate the identification of the blocks with two or more lost replicas from that of the blocks with one lost replica in a way that the identification of the blocks with two or more replicas can be accelerated while that of the blocks with one lost replica stays the same. Consequently, MFR can significantly improve data reliability while keeping the network traffic induced by data recovery stable. The results from our simulation and prototype implementation show that MFR improves the reliability of storage systems by a factor of up to 4.0 in terms of mean time to data loss. As blocks with two or more lost replicas are far fewer than blocks with one lost replica, the extra network traffic caused by MFR is less than 0.54% of total network traffic for data recovery.
通过快速修复复制存储系统中的关键块实现高可靠性
对于由数百到数千个存储节点组成的大型数据中心来说,高可靠性至关重要,这些数据中心的节点故障并不罕见。数据复制是为实现高可靠性而部署的一种典型技术。当检测到节点故障时,将识别并恢复具有丢失副本的块。长超时通常用于节点故障检测。对于具有一个丢失副本的块,长超时可以显著减少由数据恢复引起的网络流量。但是,对于具有两个或多个丢失副本的块(这可能是由大型数据中心中并不罕见的并发节点故障引起的),长超时将导致这些块丢失的高风险。在本文中,我们建议MFR将具有两个或多个丢失副本的块的识别与具有一个丢失副本的块的识别分开,这样可以加速具有两个或多个副本的块的识别,而具有一个丢失副本的块的识别保持不变。因此,MFR可以显著提高数据可靠性,同时保持数据恢复引起的网络流量稳定。我们的仿真和原型实现的结果表明,MFR在数据丢失的平均时间方面将存储系统的可靠性提高了4.0倍。由于具有两个或多个丢失副本的块远远少于具有一个丢失副本的块,因此由MFR引起的额外网络流量小于用于数据恢复的总网络流量的0.54%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信