Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2021-06-01 DOI:10.1109/IPDPSW52791.2021.00089

Jonas Posner, Lukas Reitz, Claudia Fohry

{"title":"Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks","authors":"Jonas Posner, Lukas Reitz, Claudia Fohry","doi":"10.1109/IPDPSW52791.2021.00089","DOIUrl":null,"url":null,"abstract":"With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.

查看原文本刊更多论文

动态独立任务的检查点与监督弹性方法

随着百亿亿次计算的出现，诸如应用程序不规则和永久硬件故障等问题变得越来越重要。不规则性通常通过基于任务的并行编程和工作窃取来解决。在任务层面，弹性可以通过两种主要方法提供，即检查点和监督。对于这两种情况，最近都制定出了特定的算法。它们执行本地恢复，并在减少的资源集上继续执行程序。检查点算法定期显式地保存任务描述符，而监督算法在工作窃取过程中利用它们的自然复制，并可以与窃取跟踪相结合，以减少任务重新执行的次数。到目前为止，两组算法针对不同的任务模型:针对动态独立任务的检查点算法和针对嵌套fork-join程序的监督算法。本文将最先进的监督算法转化为动态独立任务模型，从而实现了检查点与监督的比较。我们的比较包括实验和运行时间预测。结果一致表明，两种方法的典型弹性开销都低于1%。在实际相关的情况下，监督的费用较低，但检查点接管了数以百万计的流程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量