Online failure prediction for HPC resources using decentralized clustering

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI:10.1109/HiPC.2014.7116903

Alejandro Pelaez, Andres Quiroz, J. Browne, Edward Chuah, M. Parashar

{"title":"Online failure prediction for HPC resources using decentralized clustering","authors":"Alejandro Pelaez, Andres Quiroz, J. Browne, Edward Chuah, M. Parashar","doi":"10.1109/HiPC.2014.7116903","DOIUrl":null,"url":null,"abstract":"Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.

查看原文本刊更多论文

基于分散聚类的高性能计算资源在线故障预测

随着这些机器的规模不断增长，确保大规模集群的高可靠性变得越来越重要，因为这会增加不同节点之间交互的复杂性和数量，从而导致高故障频率。因此，首先预测节点故障以防止错误的发生变得非常有价值。故障预测的一种常用方法是分析系统事件的跟踪，以找到事件类型或异常事件模式与节点故障之间的相关性，并在运行时使用识别为故障预测器的类型或模式。然而，以这种方式进行故障预测的典型集中式解决方案在非常大的范围内存在较高的传输和处理开销。我们提出了一种解决方案，通过使用分散式在线聚类算法(DOC)来检测资源使用日志中的异常，从而预测大规模集群中计算节点的软锁定问题，这些异常已被证明与超级计算机集群中特定类型的节点故障相关。我们通过使用德克萨斯高级计算中心Ranger超级计算机的监控日志来证明该系统的有效性。实验表明，该方法可以达到与其他相关方法相似的精度，同时保持较低的RAM和带宽使用，对当前运行的应用程序的运行时影响小于2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 21st International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量