Protecting against rare event failures in archival systems

2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems Pub Date : 2009-12-28 DOI:10.1109/MASCOT.2009.5366825

Avani Wildani, T. Schwarz, E. L. Miller, D. Long

{"title":"Protecting against rare event failures in archival systems","authors":"Avani Wildani, T. Schwarz, E. L. Miller, D. Long","doi":"10.1109/MASCOT.2009.5366825","DOIUrl":null,"url":null,"abstract":"Digital archives are growing rapidly, necessitating stronger reliability measures than RAID to avoid data loss from device failure. Mirroring, a popular solution, is too expensive over time. We present a compromise solution that uses multi-level redundancy coding to reduce the probability of data loss from multiple simultaneous device failures. This approach handles small-scale failures of one or two devices efficiently while still allowing the system to survive rare-event, larger-scale failures of four or more devices. In our approach, each disk is split into a set of fixed size disklets which are used to construct reliability stripes. To protect against rare event failures, reliability stripes are grouped into larger super-groups, each of which has a corresponding super-parity; super-parity is only used to recover data when disk failures overwhelm the redundancy in a single reliability stripe. Super-parity can be stored on a variety of devices such as NV-RAM and always-on disks to offset write bottlenecks while still keeping the number of active devices low. Our calculations of failure probabilities show that adding super-parity allows our system to absorb many more disk failures without data loss. Through discrete event simulation, we found that adding super-groups has a significant impact on mean time to data loss and that rebuilds are slow but not unmanageable. Finally, we showed that robustness against rare events can be achieved for a fraction of total system cost.","PeriodicalId":275737,"journal":{"name":"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOT.2009.5366825","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

Abstract

Digital archives are growing rapidly, necessitating stronger reliability measures than RAID to avoid data loss from device failure. Mirroring, a popular solution, is too expensive over time. We present a compromise solution that uses multi-level redundancy coding to reduce the probability of data loss from multiple simultaneous device failures. This approach handles small-scale failures of one or two devices efficiently while still allowing the system to survive rare-event, larger-scale failures of four or more devices. In our approach, each disk is split into a set of fixed size disklets which are used to construct reliability stripes. To protect against rare event failures, reliability stripes are grouped into larger super-groups, each of which has a corresponding super-parity; super-parity is only used to recover data when disk failures overwhelm the redundancy in a single reliability stripe. Super-parity can be stored on a variety of devices such as NV-RAM and always-on disks to offset write bottlenecks while still keeping the number of active devices low. Our calculations of failure probabilities show that adding super-parity allows our system to absorb many more disk failures without data loss. Through discrete event simulation, we found that adding super-groups has a significant impact on mean time to data loss and that rebuilds are slow but not unmanageable. Finally, we showed that robustness against rare events can be achieved for a fraction of total system cost.

查看原文本刊更多论文

防止档案系统中的罕见事件故障

数字档案发展迅速，需要比RAID更强的可靠性措施，以避免设备故障导致的数据丢失。镜像是一种流行的解决方案，但随着时间的推移，它的成本太高了。我们提出了一个折衷的解决方案，使用多级冗余编码来减少多个设备同时故障造成数据丢失的概率。这种方法可以有效地处理一个或两个设备的小规模故障，同时仍然允许系统在四个或更多设备的罕见事件、大规模故障中存活下来。在我们的方法中，每个磁盘被分割成一组固定大小的小磁盘，这些小磁盘用于构建可靠性条带。为了防止罕见的事件故障，可靠性条带被分组到更大的超级组中，每个超级组都有一个相应的超级奇偶校验;只有当单个可靠性分条中硬盘故障超过冗余时，才可以使用超级奇偶校验进行数据恢复。超级奇偶校验可以存储在各种设备上，例如NV-RAM和始终在线的磁盘，以抵消写瓶颈，同时仍然保持低活动设备的数量。我们对故障概率的计算表明，添加超奇偶校验使我们的系统能够吸收更多的磁盘故障而不会丢失数据。通过离散事件模拟，我们发现添加超级组对数据丢失的平均时间有显著影响，并且重建缓慢但并非不可管理。最后，我们证明了对罕见事件的鲁棒性可以用系统总成本的一小部分来实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems

自引率

0.00%

发文量