时间机器:高性能计算系统中故障(和提前期)预测的生成实时模型

2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) Pub Date : 2023-06-01 DOI:10.1109/DSN58367.2023.00054

Khalid Ayedh Alharthi, A. Jhumka, S. Di, Lin Gui, F. Cappello, Simon McIntosh-Smith

{"title":"时间机器:高性能计算系统中故障(和提前期)预测的生成实时模型","authors":"Khalid Ayedh Alharthi, A. Jhumka, S. Di, Lin Gui, F. Cappello, Simon McIntosh-Smith","doi":"10.1109/DSN58367.2023.00054","DOIUrl":null,"url":null,"abstract":"High Performance Computing (HPC) systems generate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the natural language processing (NLP) tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine 11A Time Machine allows us to travel into the future to observe the health state of HPC system and report back. Here, we travel into the log extension to report an upcoming failure., that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer-decoders, each employing the self-attention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time Machine on four real-world HPC log datasets and compare it against three state-of-the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.","PeriodicalId":427725,"journal":{"name":"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"11 7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Time Machine: Generative Real-Time Model for Failure (and Lead Time) Prediction in HPC Systems\",\"authors\":\"Khalid Ayedh Alharthi, A. Jhumka, S. Di, Lin Gui, F. Cappello, Simon McIntosh-Smith\",\"doi\":\"10.1109/DSN58367.2023.00054\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High Performance Computing (HPC) systems generate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the natural language processing (NLP) tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine 11A Time Machine allows us to travel into the future to observe the health state of HPC system and report back. Here, we travel into the log extension to report an upcoming failure., that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer-decoders, each employing the self-attention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time Machine on four real-world HPC log datasets and compare it against three state-of-the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.\",\"PeriodicalId\":427725,\"journal\":{\"name\":\"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"volume\":\"11 7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN58367.2023.00054\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN58367.2023.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

高性能计算(HPC)系统生成大量非结构化/字母数字日志消息，这些消息捕获其组件的健康状态。由于其设计的复杂性，高性能计算系统经常会遇到中断应用程序(例如，天气预报，空气动力学模拟)执行的故障。然而，现有的故障预测方法通常寻求提取一些信息理论特征，在精度和预测速度方面都无法扩展，限制了它们在实时生产系统中的应用。在本文中，与现有工作不同，受当前基于变压器的神经网络的启发，我们提出了一种新的可扩展的基于日志的自监督模型(即不需要手动标签)，称为时间机器11A时间机器允许我们穿越到未来观察HPC系统的健康状态并报告。在这里，我们进入日志扩展以报告即将发生的故障。，预测(i)即将发生的日志事件(ii)即将发生的故障及其位置，以及(iii)预计发生故障的前置时间。时光机的设计结合了两层变压器解码器，每一层都采用了自关注机制。第一个堆栈通过预测日志事件的序列来定位故障位置，然后确定故障事件是否属于该序列的一部分。预测故障的提前时间由第二个堆栈处理。我们在四个真实的HPC日志数据集上评估了Time Machine，并将其与三种最先进的故障预测方法进行了比较。结果表明，Time Machine在预测即将发生的事件、故障位置、故障提前时间方面显著优于Bleu、Rouge、MCC和F1-score的相关工作，预测速度更快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Time Machine: Generative Real-Time Model for Failure (and Lead Time) Prediction in HPC Systems

High Performance Computing (HPC) systems generate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the natural language processing (NLP) tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine 11A Time Machine allows us to travel into the future to observe the health state of HPC system and report back. Here, we travel into the log extension to report an upcoming failure., that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer-decoders, each employing the self-attention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time Machine on four real-world HPC log datasets and compare it against three state-of-the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

自引率

0.00%

发文量