Adjustable Flat Layouts for Two-Failure Tolerant Storage Systems

2019 35th Symposium on Mass Storage Systems and Technologies (MSST) Pub Date : 2019-05-20 DOI:10.1109/MSST.2019.000-1

T. Schwarz

{"title":"Adjustable Flat Layouts for Two-Failure Tolerant Storage Systems","authors":"T. Schwarz","doi":"10.1109/MSST.2019.000-1","DOIUrl":null,"url":null,"abstract":"Systems suffer component failure at sometimes un-predictable rates. Storage systems are no exception; they add redundancy in order to deal with various types of failures. The additional storage constitutes an important capital and operational cost and needs to be dimensioned appropriately. Unfortunately, storage device failure rates are difficult to predict and change over the lifetime of the system. Large disk-based storage centers provide protection against failure at the level of objects. However, this abstraction makes it difficult to adjust to a batch of devices that fail at a higher than anticipated rate. We propose here a solution that uses large pods of storage devices of the same kind, but that can re-organize in response to an increased number of failures of components seen elsewhere in the system or to an anticipated higher failure rate such as infant mortality or end-of-life fragility. Here, I present ways of organizing user data and parity data that allow us to move from three-failure tolerance to two-tolerance and back. A storage system using disk drives that might be suffering from infant mortality can switch from an initially three-failure-tolerant layout to a two-failure-tolerant one when disks have been burnt in. It gains capacity by shedding failure tolerance that have become unnecessary. A storage system using Flash can sacrifice capacity for reliability as its components have undergone many write-erase cycles and thereby become less reliable. Adjustable reliability is easy to achieve using a standard layout based on RAID Level 6 stripes where it is easy to convert components containing user data to ones containing parity data. Here, we present layouts that unlike the RAID layout use only exclusive-or operations, and do not depend on sophisticated, but power-hungry processors. There main advantage is a noticeable increase in reliability over RAID Level 6.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2019.000-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Systems suffer component failure at sometimes un-predictable rates. Storage systems are no exception; they add redundancy in order to deal with various types of failures. The additional storage constitutes an important capital and operational cost and needs to be dimensioned appropriately. Unfortunately, storage device failure rates are difficult to predict and change over the lifetime of the system. Large disk-based storage centers provide protection against failure at the level of objects. However, this abstraction makes it difficult to adjust to a batch of devices that fail at a higher than anticipated rate. We propose here a solution that uses large pods of storage devices of the same kind, but that can re-organize in response to an increased number of failures of components seen elsewhere in the system or to an anticipated higher failure rate such as infant mortality or end-of-life fragility. Here, I present ways of organizing user data and parity data that allow us to move from three-failure tolerance to two-tolerance and back. A storage system using disk drives that might be suffering from infant mortality can switch from an initially three-failure-tolerant layout to a two-failure-tolerant one when disks have been burnt in. It gains capacity by shedding failure tolerance that have become unnecessary. A storage system using Flash can sacrifice capacity for reliability as its components have undergone many write-erase cycles and thereby become less reliable. Adjustable reliability is easy to achieve using a standard layout based on RAID Level 6 stripes where it is easy to convert components containing user data to ones containing parity data. Here, we present layouts that unlike the RAID layout use only exclusive-or operations, and do not depend on sophisticated, but power-hungry processors. There main advantage is a noticeable increase in reliability over RAID Level 6.

查看原文本刊更多论文

双故障容错存储系统的可调平面布局

系统遭受组件故障的速度有时是不可预测的。存储系统也不例外;他们增加冗余是为了处理各种类型的故障。额外的存储构成了重要的资本和运营成本，需要适当地确定其规模。不幸的是，存储设备的故障率很难预测，也很难在系统的生命周期内改变。大型基于磁盘的存储中心在对象级别提供防止故障的保护。然而，这种抽象使得难以适应一批故障率高于预期的设备。我们在这里提出一种解决方案，该解决方案使用相同类型的大型存储设备，但可以重新组织，以响应系统中其他地方出现的组件故障数量的增加，或响应预期的更高故障率，如婴儿死亡率或生命终结脆弱性。在这里，我介绍了组织用户数据和奇偶校验数据的方法，这些方法允许我们从三容错切换到两容错，然后再切换回来。使用磁盘驱动器的存储系统可能会受到婴儿死亡率的影响，当磁盘被烧入时，可以从最初的三容错布局切换到双容错布局。它通过减少不必要的故障容忍度来获得容量。使用闪存的存储系统可能会牺牲容量来换取可靠性，因为其组件经历了许多写擦周期，因此可靠性降低。使用基于RAID Level 6分条的标准布局，可以轻松地将包含用户数据的组件转换为包含奇偶校验数据的组件，从而实现可调可靠性。这里，我们介绍的布局与RAID布局不同，它只使用排他或操作，并且不依赖于复杂但耗电的处理器。与RAID Level 6相比，其主要优点是可靠性显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

自引率

0.00%

发文量