{"title":"大规模服务器集群故障预测研究综述","authors":"Zhenghua Xue, Xiaoshe Dong, Siyuan Ma, W. Dong","doi":"10.1109/SNPD.2007.284","DOIUrl":null,"url":null,"abstract":"As the size and complexity of cluster systems grows, failure rates accelerate dramatically. To reduce the disaster caused by failures, it is desirable to identify the potential failures ahead of their occurrence. In this paper, we survey the state of the art in failure prediction of cluster systems. The characteristic of failures in cluster systems are addressed, and some statistic results are shown. We explore the ways of the collection and preprocessing of data for failure prediction, and suggest a procedure for preprocessing the records in automatically generated log files. Focused on the main idea of five prediction methods, including statistic based threshold, time series analysis, rule-based classification, Bayesian network models and semi-Markov process models, are analyzed respectively. In addition, concerning the accuracy and practicality, we present five metrics for evaluating the failure prediction techniques and compare the five techniques with the five metrics.","PeriodicalId":197058,"journal":{"name":"Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":"{\"title\":\"A Survey on Failure Prediction of Large-Scale Server Clusters\",\"authors\":\"Zhenghua Xue, Xiaoshe Dong, Siyuan Ma, W. Dong\",\"doi\":\"10.1109/SNPD.2007.284\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the size and complexity of cluster systems grows, failure rates accelerate dramatically. To reduce the disaster caused by failures, it is desirable to identify the potential failures ahead of their occurrence. In this paper, we survey the state of the art in failure prediction of cluster systems. The characteristic of failures in cluster systems are addressed, and some statistic results are shown. We explore the ways of the collection and preprocessing of data for failure prediction, and suggest a procedure for preprocessing the records in automatically generated log files. Focused on the main idea of five prediction methods, including statistic based threshold, time series analysis, rule-based classification, Bayesian network models and semi-Markov process models, are analyzed respectively. In addition, concerning the accuracy and practicality, we present five metrics for evaluating the failure prediction techniques and compare the five techniques with the five metrics.\",\"PeriodicalId\":197058,\"journal\":{\"name\":\"Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007)\",\"volume\":\"116 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"38\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SNPD.2007.284\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SNPD.2007.284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Survey on Failure Prediction of Large-Scale Server Clusters
As the size and complexity of cluster systems grows, failure rates accelerate dramatically. To reduce the disaster caused by failures, it is desirable to identify the potential failures ahead of their occurrence. In this paper, we survey the state of the art in failure prediction of cluster systems. The characteristic of failures in cluster systems are addressed, and some statistic results are shown. We explore the ways of the collection and preprocessing of data for failure prediction, and suggest a procedure for preprocessing the records in automatically generated log files. Focused on the main idea of five prediction methods, including statistic based threshold, time series analysis, rule-based classification, Bayesian network models and semi-Markov process models, are analyzed respectively. In addition, concerning the accuracy and practicality, we present five metrics for evaluating the failure prediction techniques and compare the five techniques with the five metrics.