Transparent High-Speed Network Checkpoint/Restart in MPI

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI:10.1145/3236367.3236383

Julien Adam, Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger

{"title":"Transparent High-Speed Network Checkpoint/Restart in MPI","authors":"Julien Adam, Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger","doi":"10.1145/3236367.3236383","DOIUrl":null,"url":null,"abstract":"Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"123 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3236367.3236383","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.

查看原文本刊更多论文

透明高速网络检查点/重启在MPI

当涉及到大规模运行并行程序时，容错一直是一个重要的主题。据统计，硬件和软件故障预计会更频繁地发生在拥有数百万计算单元的系统上。此外，作业越大，崩溃所浪费的计算时间就越多。在本文中，我们描述了在我们的MPI运行时中完成的工作，以启用透明的检查点机制。与MPI 4.0用户级故障缓解(ULFM)接口不同，我们的工作仅针对检查点/重新启动(C/R)，而忽略了更广泛的功能，如弹性。我们将展示现有的透明检查点方法如何在MPI运行时提供充分的协作的情况下实际应用于MPI实现。然后，我们的C/R技术在MPI基准(如IMB和Lulesh)上进行了测量，这些基准依赖于Infiniband高速网络，表明所选择的方法足够通用，并且性能基本保持不变。我们认为，在目标MPI应用程序内部无需任何修改即可实现容错是可能的，并展示了它如何成为实现更集成的弹性与ULFM等故障缓解相结合的第一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th European MPI Users' Group Meeting

自引率

0.00%

发文量