通过考虑单个作业失败概率来改进检查点间隔

Alvaro Frank, Manuel Baumgartner, Reza Salkhordeh, A. Brinkmann
{"title":"通过考虑单个作业失败概率来改进检查点间隔","authors":"Alvaro Frank, Manuel Baumgartner, Reza Salkhordeh, A. Brinkmann","doi":"10.1109/IPDPS49936.2021.00038","DOIUrl":null,"url":null,"abstract":"Checkpointing is a popular resilience method in HPC and its efficiency highly depends on the choice of the checkpoint interval. Standard analytical approaches optimize intervals for big, long-running jobs that fail with high probability, while they are unable to minimize checkpointing overheads for jobs with a low or medium probability of failing. Nevertheless, our analysis of batch traces of four HPC systems shows that these jobs are extremely common.We therefore propose an iterative checkpointing algorithm to compute efficient intervals for jobs with a medium risk of failure. The method also supports big and long-running jobs by converging to the results of various traditional methods for these. We validated our algorithm using batch system simulations including traces from four HPC systems and compared it to five alternative checkpoint methods. The evaluations show up to 40% checkpoint savings for individual jobs when using our method, while improving checkpointing costs of complete HPC systems between 2.8% and 24.4% compared to the best alternative approach.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Improving checkpointing intervals by considering individual job failure probabilities\",\"authors\":\"Alvaro Frank, Manuel Baumgartner, Reza Salkhordeh, A. Brinkmann\",\"doi\":\"10.1109/IPDPS49936.2021.00038\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Checkpointing is a popular resilience method in HPC and its efficiency highly depends on the choice of the checkpoint interval. Standard analytical approaches optimize intervals for big, long-running jobs that fail with high probability, while they are unable to minimize checkpointing overheads for jobs with a low or medium probability of failing. Nevertheless, our analysis of batch traces of four HPC systems shows that these jobs are extremely common.We therefore propose an iterative checkpointing algorithm to compute efficient intervals for jobs with a medium risk of failure. The method also supports big and long-running jobs by converging to the results of various traditional methods for these. We validated our algorithm using batch system simulations including traces from four HPC systems and compared it to five alternative checkpoint methods. The evaluations show up to 40% checkpoint savings for individual jobs when using our method, while improving checkpointing costs of complete HPC systems between 2.8% and 24.4% compared to the best alternative approach.\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00038\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

检查点是高性能计算中常用的一种弹性方法,其效率很大程度上取决于检查点间隔的选择。标准的分析方法为大的、长时间运行的、高概率失败的作业优化时间间隔,而对于低概率或中等概率失败的作业,它们无法最小化检查点开销。然而,我们对四个高性能计算系统的批量跟踪分析表明,这些作业非常常见。因此,我们提出了一种迭代检查点算法来计算具有中等失败风险的作业的有效间隔。该方法还通过收敛各种传统方法的结果来支持大型和长时间运行的作业。我们使用批处理系统模拟验证了我们的算法,包括来自四个高性能计算系统的跟踪,并将其与五种替代检查点方法进行了比较。评估显示,与最佳替代方法相比,使用我们的方法可为单个作业节省高达40%的检查点,同时将完整HPC系统的检查点成本提高2.8%至24.4%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Improving checkpointing intervals by considering individual job failure probabilities
Checkpointing is a popular resilience method in HPC and its efficiency highly depends on the choice of the checkpoint interval. Standard analytical approaches optimize intervals for big, long-running jobs that fail with high probability, while they are unable to minimize checkpointing overheads for jobs with a low or medium probability of failing. Nevertheless, our analysis of batch traces of four HPC systems shows that these jobs are extremely common.We therefore propose an iterative checkpointing algorithm to compute efficient intervals for jobs with a medium risk of failure. The method also supports big and long-running jobs by converging to the results of various traditional methods for these. We validated our algorithm using batch system simulations including traces from four HPC systems and compared it to five alternative checkpoint methods. The evaluations show up to 40% checkpoint savings for individual jobs when using our method, while improving checkpointing costs of complete HPC systems between 2.8% and 24.4% compared to the best alternative approach.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信