Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters

Rohan Mohapatra, Austin Coursey, Saptarshi Sengupta
{"title":"Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters","authors":"Rohan Mohapatra, Austin Coursey, Saptarshi Sengupta","doi":"10.1109/SMARTCOMP58114.2023.00069","DOIUrl":null,"url":null,"abstract":"On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 during training and 0.86 during testing over the exhaustive 10-year data while being able to generalize competitively over other drives from the Seagate family.","PeriodicalId":163556,"journal":{"name":"2023 IEEE International Conference on Smart Computing (SMARTCOMP)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Smart Computing (SMARTCOMP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMARTCOMP58114.2023.00069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 during training and 0.86 during testing over the exhaustive 10-year data while being able to generalize competitively over other drives from the Seagate family.
分布式数据中心大规模硬盘寿命预测
每天,数据中心都要处理大量的数据,这些数据由大量廉价硬盘提供支持。存储在这些磁盘中的数据可满足从金融、医疗保健到航空航天等一系列关键功能需求。因此,过早的磁盘故障和随之而来的数据丢失可能是灾难性的。为了降低故障风险,云存储提供商执行基于状态的监控,并在硬盘发生故障之前更换硬盘。通过估计硬盘驱动器的剩余使用寿命,可以预测特定设备的故障时间,并在适当的时候更换它,确保最大限度地利用,同时降低运营成本。在这项工作中,通过结合定制特征工程和一套序列学习器,使用严重偏斜的健康统计数据进行大规模预测分析。过去的工作建议使用lstm作为预测剩余使用寿命的一种极好的方法。为此,我们提出了一个编码器-解码器LSTM模型,其中通过理解健康统计序列获得的上下文有助于预测磁盘可能出现故障前剩余天数的输出序列。在这项工作中开发的模型是在Backblaze中流通的所有10年S.M.A.R.T.健康数据的详尽集和各种磁盘实例上进行训练和测试的。它缩小了在数千台设备上实现全面培训的知识差距,并通过为从业人员提供评估和推广的有形指标来推进最先进的技术,这些从业人员希望将其工作流程扩展到磁盘制造商流通中的所有年份的健康数据。编码器-解码器LSTM在训练期间的RMSE为0.83,在详尽的10年数据测试期间的RMSE为0.86,同时能够与希捷家族的其他驱动器进行竞争。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信