Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-05 DOI:arxiv-2408.02218

Yao Xu, Gene Cooperman

{"title":"Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach","authors":"Yao Xu, Gene Cooperman","doi":"arxiv-2408.02218","DOIUrl":null,"url":null,"abstract":"MPI is the de facto standard for parallel computing on a cluster of\ncomputers. Checkpointing is an important component in any strategy for software\nresilience and for long-running jobs that must be executed by chaining together\ntime-bounded resource allocations. This work solves an old problem: a practical\nand general algorithm for transparent checkpointing of MPI that is both\nefficient and compatible with most of the latest network software. Transparent\ncheckpointing is attractive due to its generality and ease of use for most MPI\napplication developers. Earlier efforts at transparent checkpointing for MPI,\none decade ago, had two difficult problems: (i) by relying on a specific MPI\nimplementation tied to a specific network technology; and (ii) by failing to\ndemonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's\nintroduction of split processes. Problem (ii) (efficient runtime overhead) is\nsolved in this work. This paper introduces an approach that avoids these\nlimitations, employing a novel topological sort to algorithmically determine a\nsafe future synchronization point. The algorithm is valid for both blocking and\nnon-blocking collective communication in MPI. We demonstrate the efficacy and\nscalability of our approach through both micro-benchmarks and a set of five\nreal-world MPI applications, notably including the widely used VASP (Vienna Ab\nInitio Simulation Package), which is responsible for 11% of the workload on the\nPerlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was\npreviously cited as a special challenge for checkpointing, in part due to its\nmulti-algorithm codes.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02218","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

MPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing is an important component in any strategy for software resilience and for long-running jobs that must be executed by chaining together time-bounded resource allocations. This work solves an old problem: a practical and general algorithm for transparent checkpointing of MPI that is both efficient and compatible with most of the latest network software. Transparent checkpointing is attractive due to its generality and ease of use for most MPI application developers. Earlier efforts at transparent checkpointing for MPI, one decade ago, had two difficult problems: (i) by relying on a specific MPI implementation tied to a specific network technology; and (ii) by failing to demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's introduction of split processes. Problem (ii) (efficient runtime overhead) is solved in this work. This paper introduces an approach that avoids these limitations, employing a novel topological sort to algorithmically determine a safe future synchronization point. The algorithm is valid for both blocking and non-blocking collective communication in MPI. We demonstrate the efficacy and scalability of our approach through both micro-benchmarks and a set of five real-world MPI applications, notably including the widely used VASP (Vienna Ab Initio Simulation Package), which is responsible for 11% of the workload on the Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was previously cited as a special challenge for checkpointing, in part due to its multi-algorithm codes.

查看原文本刊更多论文

为 MPI 启用实用的透明检查点：拓扑排序方法

MPI 是计算机集群上并行计算的事实标准。检查点是任何软件弹性策略的重要组成部分，也是必须通过有时间限制的资源分配连锁执行的长期运行作业的重要组成部分。这项工作解决了一个老问题：为 MPI 的透明检查点提供了一种实用的通用算法，它既高效又与大多数最新的网络软件兼容。透明检查点因其通用性和对大多数 MPI 应用开发人员的易用性而极具吸引力。十年前，早期的 MPI 透明检查点技术遇到了两个棘手的问题：(i) 依赖于特定网络技术的特定 MPI 实现；(ii) 无法证明足够低的运行时开销。问题(i)（网络依赖性）已经在2019年通过MANA引入分裂进程得到解决。问题（ii）（高效运行时开销）在本文中得到了解决。本文介绍了一种避免上述限制的方法，它采用一种新颖的拓扑排序算法来确定安全的未来同步点。该算法适用于 MPI 中的阻塞和非阻塞集体通信。我们通过微基准测试和一组真实世界的 MPI 应用证明了我们方法的有效性和可扩展性，其中主要包括广泛使用的 VASP（Vienna AbInitio Simulation Package），它占劳伦斯伯克利国家实验室 Perlmutter 超级计算机 11% 的工作量。VASP 以前被认为是检查点的一个特殊挑战，部分原因是它的多算法代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量