Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems

Zhou Zhou, Wei Tang, Ziming Zheng, Z. Lan, N. Desai
{"title":"Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems","authors":"Zhou Zhou, Wei Tang, Ziming Zheng, Z. Lan, N. Desai","doi":"10.1109/CLUSTER.2011.71","DOIUrl":null,"url":null,"abstract":"With the fast improvement in technology, we are now moving toward exascale computing. Many experts predict that exascale computers will have millions of nodes, billions of threads of execution, hundreds of petabytes of inner memory and exabytes of persistent storage. For systems of such a scale, frequent failures are becoming a serious concern. One of the most important reasons is that in a large-scale system it is hard to detect failures. As a result, failure repair may take substantial time. In this paper, we investigate the effect of delayed repairing on two popular types of high-performance computing systems: IBM Blue Gene/P and general cluster. We analyze how delayed failure repairing will affect the performance of jobs when some computing units are at fault but not fixed in time. Our study is based on real workload traces and RAS logs collected from production supercomputing systems. Our Trace-based simulations indicate that fast failure detection and recovery is essential for moving towards petascale and beyond computing.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"634 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2011.71","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

With the fast improvement in technology, we are now moving toward exascale computing. Many experts predict that exascale computers will have millions of nodes, billions of threads of execution, hundreds of petabytes of inner memory and exabytes of persistent storage. For systems of such a scale, frequent failures are becoming a serious concern. One of the most important reasons is that in a large-scale system it is hard to detect failures. As a result, failure repair may take substantial time. In this paper, we investigate the effect of delayed repairing on two popular types of high-performance computing systems: IBM Blue Gene/P and general cluster. We analyze how delayed failure repairing will affect the performance of jobs when some computing units are at fault but not fixed in time. Our study is based on real workload traces and RAS logs collected from production supercomputing systems. Our Trace-based simulations indicate that fast failure detection and recovery is essential for moving towards petascale and beyond computing.
大型系统延迟故障修复的性能影响评估
随着技术的快速发展,我们正朝着百亿亿次计算的方向发展。许多专家预测,百亿亿次计算机将拥有数百万个节点、数十亿个执行线程、数百pb的内部内存和eb的持久存储空间。对于如此规模的系统,频繁的故障正成为一个严重的问题。最重要的原因之一是,在大型系统中很难检测到故障。因此,故障修复可能需要相当长的时间。在本文中,我们研究了延迟修复对两种流行的高性能计算系统:IBM Blue Gene/P和通用集群的影响。分析了在部分计算单元出现故障但没有及时修复的情况下,延迟故障修复对作业性能的影响。我们的研究基于从生产超级计算系统收集的真实工作负载跟踪和RAS日志。我们基于trace的模拟表明,快速故障检测和恢复对于向千万亿级和超越计算的方向发展至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信