从任务图到具有本地重启的异步分布式检查点

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2020-11-01 DOI:10.1109/FTXS51974.2020.00009

Romain Lion, Samuel Thibault

{"title":"从任务图到具有本地重启的异步分布式检查点","authors":"Romain Lion, Samuel Thibault","doi":"10.1109/FTXS51974.2020.00009","DOIUrl":null,"url":null,"abstract":"The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus opening up for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10 % on a dense linear algebra example.","PeriodicalId":123780,"journal":{"name":"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"From tasks graphs to asynchronous distributed checkpointing with local restart\",\"authors\":\"Romain Lion, Samuel Thibault\",\"doi\":\"10.1109/FTXS51974.2020.00009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus opening up for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10 % on a dense linear algebra example.\",\"PeriodicalId\":123780,\"journal\":{\"name\":\"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FTXS51974.2020.00009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FTXS51974.2020.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

当前高性能计算平台上的计算单元数量不断增加，导致故障概率也在不断增加。传统的检查点/重启策略避免了在发生此类故障时浪费大量的计算时间。然而，随着当前应用程序处理的数据量的增加，这些策略受到数据传输需求变得不合理或所需的全局同步的影响。同时，当前基于任务的编程趋势是一个重新审视检查点/重新启动策略原则的机会。我们在这里提出了一个检查点方案，它与任务图的执行密切相关。我们描述了它如何允许完全异步和分布式检查点，以及本地化节点重启，从而打开了非常大的可伸缩性。我们还展示了应用程序数据传输和检查点传输之间的协同如何导致合理的额外网络负载，在密集线性代数示例中测量的负载低于+ 10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

From tasks graphs to asynchronous distributed checkpointing with local restart

The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus opening up for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10 % on a dense linear algebra example.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

自引率

0.00%

发文量