Fault Tolerance for OpenSHMEM

Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, S. Pophale, A. Welch, S. Poole, B. Chapman
{"title":"Fault Tolerance for OpenSHMEM","authors":"Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, S. Pophale, A. Welch, S. Poole, B. Chapman","doi":"10.1145/2676870.2676894","DOIUrl":null,"url":null,"abstract":"On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"167 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Partitioned Global Address Space Programming Models","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2676870.2676894","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model.
OpenSHMEM的容错性
在今天的超级计算系统中,故障正在成为一种常态,而不是例外。考虑到在未来的系统上实现预期的可伸缩性和性能所需的复杂性,这种情况预计会变得更糟。预计这些系统将在几乎持续存在故障的情况下运行。为了在这些系统上高效地工作,编程模型将要求硬件和软件对故障都具有弹性。随着PGAS编程模型和OpenSHMEM作为高性能计算软件栈的一部分的重要性日益提高,缺乏容错模型可能成为其用户的负担。为此,在本文中,我们讨论了使用检查点/重启作为OpenSHMEM容错方法的可行性,提出了一个选择性检查点/重启容错模型,并讨论了与实现所提议模型相关的挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信