{"title":"Mitigate HDD Fail-Slow by Pro-actively Utilizing System-level Data Redundancy with Enhanced HDD Controllability and Observability","authors":"Jingpeng Hao, Yin Li, Xubin Chen, Tong Zhang","doi":"10.1109/MSST.2019.000-2","DOIUrl":null,"url":null,"abstract":"This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2019.000-2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.