Combining Partial Redundancy and Checkpointing for HPC

2012 IEEE 32nd International Conference on Distributed Computing Systems Pub Date : 2012-06-18 DOI:10.1109/ICDCS.2012.56

James Elliott, Kishor Kharbas, David Fiala, F. Mueller, Kurt B. Ferreira, C. Engelmann

{"title":"Combining Partial Redundancy and Checkpointing for HPC","authors":"James Elliott, Kishor Kharbas, David Fiala, F. Mueller, Kurt B. Ferreira, C. Engelmann","doi":"10.1109/ICDCS.2012.56","DOIUrl":null,"url":null,"abstract":"Today's largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, redundant copies can decrease the overall failure rate. The downside of redundancy is that extra resources are required and there is an additional overhead on communication and synchronization. This work contributes a model and analyzes the benefit of C/R in coordination with redundancy at different degrees to minimize the total wallclock time and resources utilization of HPC applications. We further conduct experiments with an implementation of redundancy within the MPI layer on a cluster. Our experimental results confirm the benefit of dual and triple redundancy - but not for partial redundancy - and show a close fit to the model. At ≈ 80, 000 processes, dual redundancy requires twice the number of processing resources for an application but allows two jobs of 128 hours wallclock time to finish within the time of just one job without redundancy. For narrow ranges of processor counts, partial redundancy results in the lowest time. Once the count exceeds ≈ 770, 000, triple redundancy has the lowest overall cost. Thus, redundancy allows one to trade-off additional resource requirements against wallclock time, which provides a tuning knob for users to adapt to resource availabilities.","PeriodicalId":6300,"journal":{"name":"2012 IEEE 32nd International Conference on Distributed Computing Systems","volume":"26 1","pages":"615-626"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"142","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 32nd International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2012.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 142

Abstract

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, redundant copies can decrease the overall failure rate. The downside of redundancy is that extra resources are required and there is an additional overhead on communication and synchronization. This work contributes a model and analyzes the benefit of C/R in coordination with redundancy at different degrees to minimize the total wallclock time and resources utilization of HPC applications. We further conduct experiments with an implementation of redundancy within the MPI layer on a cluster. Our experimental results confirm the benefit of dual and triple redundancy - but not for partial redundancy - and show a close fit to the model. At ≈ 80, 000 processes, dual redundancy requires twice the number of processing resources for an application but allows two jobs of 128 hours wallclock time to finish within the time of just one job without redundancy. For narrow ranges of processor counts, partial redundancy results in the lowest time. Once the count exceeds ≈ 770, 000, triple redundancy has the lowest overall cost. Thus, redundancy allows one to trade-off additional resource requirements against wallclock time, which provides a tuning knob for users to adapt to resource availabilities.

查看原文本刊更多论文

结合部分冗余和检查点的高性能计算

当今最大的高性能计算(HPC)系统超过了每秒一千万亿次浮点运算(每秒1015次浮点运算)，而exascale系统预计将在七年内实现。但可靠性正成为百亿亿次计算面临的主要挑战之一。对于十亿核并行，平均故障时间预计在几分钟或几小时内，而不是几天。在HPC应用程序的执行过程中，失败正在成为常态，而不是例外。当前HPC中的容错技术主要集中在响应式的方式来减轻故障，即通过检查点和重启(C/R)。除了存储开销之外，基于C/ r的故障恢复在应用程序性能方面还需要额外的成本，因为当采取检查点时，正常的执行会中断。研究表明，大规模运行的应用程序花费了超过50%的时间来保存检查点、重新启动和重做丢失的工作。冗余是另一种容错技术，它使用冗余进程执行相同的任务。如果一个进程失败，它的一个副本可以接管它的执行。因此，冗余副本可以降低总体故障率。冗余的缺点是需要额外的资源，并且在通信和同步方面有额外的开销。本文建立了一个模型，并分析了C/R在不同程度的冗余协调下的效益，以最大限度地减少HPC应用程序的总时钟时间和资源利用率。我们进一步在集群的MPI层内进行了冗余实现的实验。我们的实验结果证实了双重和三重冗余的好处-但不是部分冗余-并显示出与模型的密切拟合。在≈80,000个进程时，双冗余需要两倍于应用程序的处理资源数量，但允许两个128小时时钟时间的作业在一个没有冗余的作业的时间内完成。对于处理器数量范围较窄的情况，部分冗余可以节省最少的时间。一旦数量超过约77万，三层冗余的总成本最低。因此，冗余允许人们权衡额外的资源需求和时钟时间，这为用户提供了一个调整旋钮，以适应资源可用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 32nd International Conference on Distributed Computing Systems

自引率

0.00%

发文量