Mitigate HDD Fail-Slow by Pro-actively Utilizing System-level Data Redundancy with Enhanced HDD Controllability and Observability

2019 35th Symposium on Mass Storage Systems and Technologies (MSST) Pub Date : 2019-05-20 DOI:10.1109/MSST.2019.000-2

Jingpeng Hao, Yin Li, Xubin Chen, Tong Zhang

{"title":"Mitigate HDD Fail-Slow by Pro-actively Utilizing System-level Data Redundancy with Enhanced HDD Controllability and Observability","authors":"Jingpeng Hao, Yin Li, Xubin Chen, Tong Zhang","doi":"10.1109/MSST.2019.000-2","DOIUrl":null,"url":null,"abstract":"This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2019.000-2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.

查看原文本刊更多论文

通过主动利用系统级数据冗余，增强HDD可控性和可观察性，减轻HDD Fail-Slow

本文提出了一个设计框架，旨在减轻偶尔的HDD慢速故障。由于其机械性质，hdd可能偶尔会出现异常高的内部读取重试率，导致速度暂时显著下降(尤其是读取延迟)。直观地说，人们可以期望现有的系统级数据冗余(例如，RAID或分布式擦除编码)可能会被机会性地利用来减轻HDD慢速故障。然而，目前的做法倾向于将系统级冗余仅仅用作安全网，即，只有在代价高昂的hdd内部读取重试失败后，才通过系统级冗余重建数据扇区。本文表明，通过更主动地利用现有的系统级数据冗余，可以更有效地缓解偶尔的HDD故障缓慢，以补充(甚至取代)HDD内部读取重试。为了实现这一点，hdd应该在其内部读重试操作方面支持更高程度的可控性和可观察性。假设一种非常简单的形式增强了HDD的可控性和可观察性，本文提出了设计解决方案和数学公式框架，以促进这种主动策略的实际实施，以减轻偶尔的HDD故障缓慢。使用RAID作为测试工具，我们的实验结果表明，即使hdd遭受高达1%或2%的读重试率，我们提出的设计解决方案也可以有效地缓解RAID读延迟退化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

自引率

0.00%

发文量