Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.122

S. Di, M. Bouguerra, L. Bautista-Gomez, F. Cappello

{"title":"Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications","authors":"S. Di, M. Bouguerra, L. Bautista-Gomez, F. Cappello","doi":"10.1109/IPDPS.2014.122","DOIUrl":null,"url":null,"abstract":"HPC community projects that future extreme scale systems will be much less stable than current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the completion of large scale numerical computations. Execution failures may occur due to multiple factors with different scales, from transient uncorrectable memory errors localized in processes to massive system outages. Multi-level checkpoint/restart is a promising model that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: e.g., local memory, remote memory, using a software RAID, local SSD, remote file system. In this paper, we respond to two open questions: 1) how to optimize the selection of checkpoint levels based on failure distributions observed in a system, 2) how to compute the optimal checkpoint intervals for each of these levels. The contribution is three-fold. (1) We build a mathematical model to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures. (2) We theoretically optimize the entire execution performance for each parallel application by selecting the best checkpoint level combination and corresponding checkpoint intervals. (3) We characterize checkpoint overheads on different checkpoint levels in a real cluster environment, and evaluate our optimal solutions using both simulation with millions of cores and real environment with real-world MPI programs running on hundreds of cores. Experiments show that optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions by 5-50 percent.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"92","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.122","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 92

Abstract

HPC community projects that future extreme scale systems will be much less stable than current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the completion of large scale numerical computations. Execution failures may occur due to multiple factors with different scales, from transient uncorrectable memory errors localized in processes to massive system outages. Multi-level checkpoint/restart is a promising model that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: e.g., local memory, remote memory, using a software RAID, local SSD, remote file system. In this paper, we respond to two open questions: 1) how to optimize the selection of checkpoint levels based on failure distributions observed in a system, 2) how to compute the optimal checkpoint intervals for each of these levels. The contribution is three-fold. (1) We build a mathematical model to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures. (2) We theoretically optimize the entire execution performance for each parallel application by selecting the best checkpoint level combination and corresponding checkpoint intervals. (3) We characterize checkpoint overheads on different checkpoint levels in a real cluster environment, and evaluate our optimal solutions using both simulation with millions of cores and real environment with real-world MPI programs running on hundreds of cores. Experiments show that optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions by 5-50 percent.

查看原文本刊更多论文

面向大规模高性能计算应用的多级检查点模型优化

高性能计算社区预测，未来的极端规模系统将远不如当前的千万亿级系统稳定，因此需要复杂的容错能力来保证大规模数值计算的完成。执行失败可能是由于多种不同规模的因素造成的，从进程局部的暂时的不可纠正的内存错误到大规模的系统中断。多级检查点/重启是一种很有前途的模型，它提供弹性响应以容忍不同类型的故障。它将检查点存储在不同的级别:例如，本地内存、远程内存、使用软件RAID、本地SSD、远程文件系统。在本文中，我们回答了两个开放性问题:1)如何根据系统中观察到的故障分布优化检查点级别的选择，2)如何计算每个这些级别的最佳检查点间隔。贡献有三方面。(1)针对不同类型的故障，建立了适合大规模应用的多层次检查点/重启机制的数学模型。(2)通过选择最佳的检查点级别组合和相应的检查点间隔，从理论上优化每个并行应用程序的整体执行性能。(3)我们在真实集群环境中描述了不同检查点级别上的检查点开销，并使用具有数百万内核的模拟和具有数百个内核上运行的真实MPI程序的真实环境来评估我们的最佳解决方案。实验表明，与每个关卡的最佳检查点间隔相关的优化关卡选择比其他最先进的解决方案要好5- 50%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量