Fault Tolerance for OpenSHMEM

International Conference on Partitioned Global Address Space Programming Models Pub Date : 2014-10-06 DOI:10.1145/2676870.2676894

Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, S. Pophale, A. Welch, S. Poole, B. Chapman

引用次数: 12

Abstract

On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model.

查看原文本刊更多论文

OpenSHMEM的容错性

在今天的超级计算系统中，故障正在成为一种常态，而不是例外。考虑到在未来的系统上实现预期的可伸缩性和性能所需的复杂性，这种情况预计会变得更糟。预计这些系统将在几乎持续存在故障的情况下运行。为了在这些系统上高效地工作，编程模型将要求硬件和软件对故障都具有弹性。随着PGAS编程模型和OpenSHMEM作为高性能计算软件栈的一部分的重要性日益提高，缺乏容错模型可能成为其用户的负担。为此，在本文中，我们讨论了使用检查点/重启作为OpenSHMEM容错方法的可行性，提出了一个选择性检查点/重启容错模型，并讨论了与实现所提议模型相关的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Partitioned Global Address Space Programming Models

自引率

0.00%

发文量