{"title":"Hard Drive Failure Prediction Using Big Data","authors":"Wenjun Yang, Dianming Hu, Yuliang Liu, Shuhao Wang, Tianming Jiang","doi":"10.1109/SRDSW.2015.15","DOIUrl":null,"url":null,"abstract":"We design a general framework named Hdoctor for hard drive failure prediction. Hdoctor leverages the power of big data to achieve a significant improvement comparing to all previous researches that used sophisticated machine learning algorithms. Hdoctor exhibits a series of engineering innovations: (1) constructing time dependent features to characterize the Self-Monitoring, Analysis and Reporting Technology (SMART) value transitions during disk failures, (2) combining features to enable the model to learn the correlation among different SMART attributes, (3) regarding circumstance data such as cluster workload, temperature, humidity, location as related features. Meanwhile, Hdoctor collects/labels samples and updates model automatically, and works well for all kinds of disk failure prediction in our intelligent data center. In this work, we use Hdoctor to collect 74,477,717 training records from our clusters involving 220,022 disks. By training a simple and scalable model, our system achieves a detection rate of 97.82%, with a false alarm rate (FAR) of 0.3%, which hugely outperforms all previous algorithms. In addition, Hdoctor is an excellent indicator for how to predict different hardware failures efficiently under various circumstances.","PeriodicalId":415692,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW)","volume":"11 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SRDSW.2015.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34
Abstract
We design a general framework named Hdoctor for hard drive failure prediction. Hdoctor leverages the power of big data to achieve a significant improvement comparing to all previous researches that used sophisticated machine learning algorithms. Hdoctor exhibits a series of engineering innovations: (1) constructing time dependent features to characterize the Self-Monitoring, Analysis and Reporting Technology (SMART) value transitions during disk failures, (2) combining features to enable the model to learn the correlation among different SMART attributes, (3) regarding circumstance data such as cluster workload, temperature, humidity, location as related features. Meanwhile, Hdoctor collects/labels samples and updates model automatically, and works well for all kinds of disk failure prediction in our intelligent data center. In this work, we use Hdoctor to collect 74,477,717 training records from our clusters involving 220,022 disks. By training a simple and scalable model, our system achieves a detection rate of 97.82%, with a false alarm rate (FAR) of 0.3%, which hugely outperforms all previous algorithms. In addition, Hdoctor is an excellent indicator for how to predict different hardware failures efficiently under various circumstances.