Optimization of cloud task processing with checkpoint-restart mechanism

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2013-11-17 DOI:10.1145/2503210.2503217

S. Di, Y. Robert, F. Vivien, Derrick Kondo, Cho-Li Wang, F. Cappello

{"title":"Optimization of cloud task processing with checkpoint-restart mechanism","authors":"S. Di, Y. Robert, F. Vivien, Derrick Kondo, Cho-Li Wang, F. Cappello","doi":"10.1145/2503210.2503217","DOIUrl":null,"url":null,"abstract":"In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"80","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2503210.2503217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 80

Abstract

In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.

查看原文本刊更多论文

基于检查点重启机制的云任务处理优化

在本文中，我们的目标是优化基于检查点/重启机制的容错技术，在云计算的背景下。我们的贡献有三方面。(1)我们推导了一个新的公式来计算具有不同故障事件分布的云作业的最优检查点数量。本文的分析不仅具有通用性，不需要对失效概率分布进行假设，而且在实际应用中也非常简单。(2)我们设计了一种自适应算法来优化检查点对检查点/重启开销等各种成本的影响。(3)我们在一个真实的集群环境中使用数百个虚拟机和Berkeley Lab Checkpoint/Restart工具来评估我们的优化解决方案。任务失败事件通过在大型Google数据中心上生成的生产跟踪来模拟。实验证明我们的解决方案非常适合Google系统。我们的优化公式比Young的公式高出3- 10%，平均每个作业减少了50-100秒的挂钟长度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

自引率

0.00%

发文量