Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart

Masoud Gholami, F. Schintke
{"title":"Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart","authors":"Masoud Gholami, F. Schintke","doi":"10.1109/IPDPS49936.2021.00036","DOIUrl":null,"url":null,"abstract":"Checkpoint/restart (C/R) makes large-scale parallel jobs resilient against multiple node failures but typically takes considerable time and storage space. Efficient C/R strategies try to gain high levels of fault-tolerance while keeping the involved I/O and computation low. By combining XOR and partner checkpointing, two relatively weak C/R strategies, we develop and evaluate a stable, scalable, and fast C/R approach (including initialization, checkpointing, version consensus, and recovery mechanisms) that outperforms other C/R methods such as Reed-Solomon checkpointing in terms of stability and performance.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Checkpoint/restart (C/R) makes large-scale parallel jobs resilient against multiple node failures but typically takes considerable time and storage space. Efficient C/R strategies try to gain high levels of fault-tolerance while keeping the involved I/O and computation low. By combining XOR and partner checkpointing, two relatively weak C/R strategies, we develop and evaluate a stable, scalable, and fast C/R approach (including initialization, checkpointing, version consensus, and recovery mechanisms) that outperforms other C/R methods such as Reed-Solomon checkpointing in terms of stability and performance.
结合XOR和合作伙伴检查点弹性多层次检查点/重启
检查点/重新启动(C/R)使大规模并行作业对多个节点故障具有弹性,但通常需要大量的时间和存储空间。高效的C/R策略试图获得高水平的容错性,同时保持较低的I/O和计算量。通过结合XOR和伙伴检查点这两种相对较弱的C/R策略,我们开发并评估了一种稳定、可扩展和快速的C/R方法(包括初始化、检查点、版本共识和恢复机制),该方法在稳定性和性能方面优于其他C/R方法,如Reed-Solomon检查点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信