BlueGene/L Failure Analysis and Prediction Models

Yinglung Liang, Yanyong Zhang, A. Sivasubramaniam, M. Jette, R. Sahoo
{"title":"BlueGene/L Failure Analysis and Prediction Models","authors":"Yinglung Liang, Yanyong Zhang, A. Sivasubramaniam, M. Jette, R. Sahoo","doi":"10.1109/DSN.2006.18","DOIUrl":null,"url":null,"abstract":"The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128 K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures","PeriodicalId":228470,"journal":{"name":"International Conference on Dependable Systems and Networks (DSN'06)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"292","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Dependable Systems and Networks (DSN'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2006.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 292

Abstract

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128 K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures
BlueGene/L失效分析与预测模型
一些科学应用程序不断增长的计算和存储需求要求部署超大规模并行机器,例如IBM的BlueGene/L,它可以容纳多达128 K的处理器。在生产环境中设计和部署这些系统时面临的挑战之一是需要考虑故障发生,无论是硬件还是软件。早期的工作表明,常规的运行时容错技术(如周期性检查点)对新兴系统并不有效。相反,预测故障发生的能力可以帮助开发更有效的检查点策略。失效预测一直被认为是一个具有挑战性的研究问题,主要原因是缺乏来自实际生产系统的真实失效数据。在这项研究中,我们从BlueGene/L收集了超过100天的RAS事件日志。我们研究了致死性失效事件的特征,以及致死性事件与非致死性事件之间的相关性。根据观察,我们开发了三种简单而有效的故障预测方法,可以预测大约80%的内存和网络故障,以及47%的应用程序I/O故障
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信