Zhou Zhou, Wei Tang, Ziming Zheng, Z. Lan, N. Desai
{"title":"Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems","authors":"Zhou Zhou, Wei Tang, Ziming Zheng, Z. Lan, N. Desai","doi":"10.1109/CLUSTER.2011.71","DOIUrl":null,"url":null,"abstract":"With the fast improvement in technology, we are now moving toward exascale computing. Many experts predict that exascale computers will have millions of nodes, billions of threads of execution, hundreds of petabytes of inner memory and exabytes of persistent storage. For systems of such a scale, frequent failures are becoming a serious concern. One of the most important reasons is that in a large-scale system it is hard to detect failures. As a result, failure repair may take substantial time. In this paper, we investigate the effect of delayed repairing on two popular types of high-performance computing systems: IBM Blue Gene/P and general cluster. We analyze how delayed failure repairing will affect the performance of jobs when some computing units are at fault but not fixed in time. Our study is based on real workload traces and RAS logs collected from production supercomputing systems. Our Trace-based simulations indicate that fast failure detection and recovery is essential for moving towards petascale and beyond computing.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"634 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2011.71","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
With the fast improvement in technology, we are now moving toward exascale computing. Many experts predict that exascale computers will have millions of nodes, billions of threads of execution, hundreds of petabytes of inner memory and exabytes of persistent storage. For systems of such a scale, frequent failures are becoming a serious concern. One of the most important reasons is that in a large-scale system it is hard to detect failures. As a result, failure repair may take substantial time. In this paper, we investigate the effect of delayed repairing on two popular types of high-performance computing systems: IBM Blue Gene/P and general cluster. We analyze how delayed failure repairing will affect the performance of jobs when some computing units are at fault but not fixed in time. Our study is based on real workload traces and RAS logs collected from production supercomputing systems. Our Trace-based simulations indicate that fast failure detection and recovery is essential for moving towards petascale and beyond computing.
随着技术的快速发展,我们正朝着百亿亿次计算的方向发展。许多专家预测,百亿亿次计算机将拥有数百万个节点、数十亿个执行线程、数百pb的内部内存和eb的持久存储空间。对于如此规模的系统,频繁的故障正成为一个严重的问题。最重要的原因之一是,在大型系统中很难检测到故障。因此,故障修复可能需要相当长的时间。在本文中,我们研究了延迟修复对两种流行的高性能计算系统:IBM Blue Gene/P和通用集群的影响。分析了在部分计算单元出现故障但没有及时修复的情况下,延迟故障修复对作业性能的影响。我们的研究基于从生产超级计算系统收集的真实工作负载跟踪和RAS日志。我们基于trace的模拟表明,快速故障检测和恢复对于向千万亿级和超越计算的方向发展至关重要。