Joshua Thompson, D. Dreisigmeyer, T. Jones, M. Kirby, Joshua Ladd
{"title":"Accurate fault prediction of BlueGene/P RAS logs via geometric reduction","authors":"Joshua Thompson, D. Dreisigmeyer, T. Jones, M. Kirby, Joshua Ladd","doi":"10.1109/DSNW.2010.5542626","DOIUrl":null,"url":null,"abstract":"This investigation presents two distinct and novel approaches for the prediction of system failures occurring in Oak Ridge National Laboratory's Blue Gene/P supercomputer. Each technique uses raw numeric and textual subsets of large data logs of physical system information such as fan speeds and CPU temperatures. This data is used to develop models of the system capable of sensing anomalies, or deviations from nominal behavior. Each algorithm predicted event log reported anomalies in advance of their occurrence and one algorithm did so without false positives. Both algorithms predicted an anomaly that did not appear in the event log. It was later learned that the fault missing from the log but predicted by both algorithms was confirmed to have occurred by the system administrator.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSNW.2010.5542626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
This investigation presents two distinct and novel approaches for the prediction of system failures occurring in Oak Ridge National Laboratory's Blue Gene/P supercomputer. Each technique uses raw numeric and textual subsets of large data logs of physical system information such as fan speeds and CPU temperatures. This data is used to develop models of the system capable of sensing anomalies, or deviations from nominal behavior. Each algorithm predicted event log reported anomalies in advance of their occurrence and one algorithm did so without false positives. Both algorithms predicted an anomaly that did not appear in the event log. It was later learned that the fault missing from the log but predicted by both algorithms was confirmed to have occurred by the system administrator.