{"title":"A Runtime Fault Detection Method for HPC Cluster","authors":"Wu Linping, Luo Hongbing, Z. Jianfeng, Meng Dan","doi":"10.1109/PDCAT.2011.9","DOIUrl":null,"url":null,"abstract":"As the number of nodes keeps increasing, faults have become commonplace for HPC cluster. For fast recovery from faults, the fault detection method is necessary. Based on the usage patterns of HPC cluster, a automatic runtime fault detection mechanism is proposed in this paper: First, the normal activities for nodes in HPC cluster are modeled using runtime state by clustering analysis, Second, the fault detection process is implemented by comparing the current runtime state of nodes with normal activity models. A fault alarm is made immediately when the current runtime state deviates from the normal activity models. In the experiments, the faults are simulated by fault injection methods and the experimental results show that the runtime fault detection method in this paper can detect faults with high accuracy.","PeriodicalId":137617,"journal":{"name":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2011.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As the number of nodes keeps increasing, faults have become commonplace for HPC cluster. For fast recovery from faults, the fault detection method is necessary. Based on the usage patterns of HPC cluster, a automatic runtime fault detection mechanism is proposed in this paper: First, the normal activities for nodes in HPC cluster are modeled using runtime state by clustering analysis, Second, the fault detection process is implemented by comparing the current runtime state of nodes with normal activity models. A fault alarm is made immediately when the current runtime state deviates from the normal activity models. In the experiments, the faults are simulated by fault injection methods and the experimental results show that the runtime fault detection method in this paper can detect faults with high accuracy.