Automatic fault characterization via abnormality-enhanced classification

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) Pub Date : 2012-06-25 DOI:10.1109/DSN.2012.6263926

G. Bronevetsky, I. Laguna, B. Supinski, S. Bagchi

{"title":"Automatic fault characterization via abnormality-enhanced classification","authors":"G. Bronevetsky, I. Laguna, B. Supinski, S. Bagchi","doi":"10.1109/DSN.2012.6263926","DOIUrl":null,"url":null,"abstract":"Enterprise and high-performance computing systems are growing extremely large and complex, employing many processors and diverse software/hardware stacks. As these machines grow in scale, faults become more frequent and system complexity makes it difficult to detect and to diagnose them. The difficulty is particularly large for faults that degrade system performance or cause erratic behavior but do not cause outright crashes. The cost of these errors is high since they significantly reduce system productivity, both initially and by time required to resolve them. Current system management techniques do not work well since they require manual examination of system behavior and do not identify root causes. When a fault is manifested, system administrators need timely notification about the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize normal and abnormal system behavior. However, the complex effects of system faults are less amenable to these techniques. This paper demonstrates that the complexity of system faults makes traditional classification and clustering algorithms inadequate for characterizing them. We design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy significantly. Our experiments demonstrate that our techniques can detect and characterize faults with 85% accuracy, compared to just 12% accuracy for direct applications of traditional techniques.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2012.6263926","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

Abstract

Enterprise and high-performance computing systems are growing extremely large and complex, employing many processors and diverse software/hardware stacks. As these machines grow in scale, faults become more frequent and system complexity makes it difficult to detect and to diagnose them. The difficulty is particularly large for faults that degrade system performance or cause erratic behavior but do not cause outright crashes. The cost of these errors is high since they significantly reduce system productivity, both initially and by time required to resolve them. Current system management techniques do not work well since they require manual examination of system behavior and do not identify root causes. When a fault is manifested, system administrators need timely notification about the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize normal and abnormal system behavior. However, the complex effects of system faults are less amenable to these techniques. This paper demonstrates that the complexity of system faults makes traditional classification and clustering algorithms inadequate for characterizing them. We design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy significantly. Our experiments demonstrate that our techniques can detect and characterize faults with 85% accuracy, compared to just 12% accuracy for direct applications of traditional techniques.

查看原文本刊更多论文

通过异常增强分类自动故障表征

企业和高性能计算系统正在变得极其庞大和复杂，使用许多处理器和各种软件/硬件堆栈。随着这些机器规模的增长，故障变得更加频繁，系统的复杂性使得检测和诊断它们变得困难。对于降低系统性能或导致不稳定行为但不会导致彻底崩溃的故障，难度尤其大。这些错误的成本很高，因为它们在最初和解决它们所需的时间上都大大降低了系统的生产力。当前的系统管理技术不能很好地工作，因为它们需要手动检查系统行为，并且不能识别根本原因。当出现故障时，系统管理员需要及时了解故障的类型、发生的时间和故障的处理器。统计建模方法可以准确地描述系统的正常和异常行为。然而，系统故障的复杂影响不太适合这些技术。本文论证了系统故障的复杂性使得传统的分类和聚类算法无法对系统故障进行表征。我们设计了将分类算法与应用程序行为异常信息相结合的新技术，以显着提高检测和表征的准确性。我们的实验表明，我们的技术可以以85%的准确率检测和表征故障，而传统技术的直接应用只有12%的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)

自引率

0.00%

发文量