Transparent High-Speed Network Checkpoint/Restart in MPI

Julien Adam, Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger
{"title":"Transparent High-Speed Network Checkpoint/Restart in MPI","authors":"Julien Adam, Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger","doi":"10.1145/3236367.3236383","DOIUrl":null,"url":null,"abstract":"Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"123 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3236367.3236383","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.
透明高速网络检查点/重启在MPI
当涉及到大规模运行并行程序时,容错一直是一个重要的主题。据统计,硬件和软件故障预计会更频繁地发生在拥有数百万计算单元的系统上。此外,作业越大,崩溃所浪费的计算时间就越多。在本文中,我们描述了在我们的MPI运行时中完成的工作,以启用透明的检查点机制。与MPI 4.0用户级故障缓解(ULFM)接口不同,我们的工作仅针对检查点/重新启动(C/R),而忽略了更广泛的功能,如弹性。我们将展示现有的透明检查点方法如何在MPI运行时提供充分的协作的情况下实际应用于MPI实现。然后,我们的C/R技术在MPI基准(如IMB和Lulesh)上进行了测量,这些基准依赖于Infiniband高速网络,表明所选择的方法足够通用,并且性能基本保持不变。我们认为,在目标MPI应用程序内部无需任何修改即可实现容错是可能的,并展示了它如何成为实现更集成的弹性与ULFM等故障缓解相结合的第一步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信