Simplifying the Recovery Model of User-Level Failure Mitigation

Wesley Bland, Kenneth Raffenetti, P. Balaji
{"title":"Simplifying the Recovery Model of User-Level Failure Mitigation","authors":"Wesley Bland, Kenneth Raffenetti, P. Balaji","doi":"10.1109/ExaMPI.2014.4","DOIUrl":null,"url":null,"abstract":"As resilience research in high-performance computing has matured, so too have the tools, libraries, and languages that result from it. The Message Passing Interface (MPI) Forum is considering the addition of fault tolerance to a future version of the MPI standard, and a new chapter called User-Level Failure Mitigation (ULFM) has been proposed to fill this need. However, as ULFM usage has become more widespread, many potential users are concerned about its complexity and the need to rewrite existing codes. In this paper, we present a usage model that is similar to the usage already common in existing codes but that does not require the user to restart the application (thereby incurring the costs of re-entering the batch queue, startup costs, etc.). We use a new implementation of ULFM in MPICH, a popular open source MPI implementation, and demonstrate the ULFM usage using the Monte Carlo Communication Kernel, a proxy-app developed by the Center for Exascale Simulation of Advanced Reactors. Results show that the approach used incurs a level of intrusiveness into the code similar to that of existing checkpoint/restart models, but with less overhead.","PeriodicalId":425070,"journal":{"name":"2014 Workshop on Exascale MPI at Supercomputing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 Workshop on Exascale MPI at Supercomputing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ExaMPI.2014.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

As resilience research in high-performance computing has matured, so too have the tools, libraries, and languages that result from it. The Message Passing Interface (MPI) Forum is considering the addition of fault tolerance to a future version of the MPI standard, and a new chapter called User-Level Failure Mitigation (ULFM) has been proposed to fill this need. However, as ULFM usage has become more widespread, many potential users are concerned about its complexity and the need to rewrite existing codes. In this paper, we present a usage model that is similar to the usage already common in existing codes but that does not require the user to restart the application (thereby incurring the costs of re-entering the batch queue, startup costs, etc.). We use a new implementation of ULFM in MPICH, a popular open source MPI implementation, and demonstrate the ULFM usage using the Monte Carlo Communication Kernel, a proxy-app developed by the Center for Exascale Simulation of Advanced Reactors. Results show that the approach used incurs a level of intrusiveness into the code similar to that of existing checkpoint/restart models, but with less overhead.
简化用户级故障缓解的恢复模型
随着高性能计算领域的弹性研究日趋成熟,由此产生的工具、库和语言也日趋成熟。消息传递接口(MPI)论坛正在考虑在未来版本的MPI标准中增加容错功能,并提出了一个名为用户级故障缓解(ULFM)的新章节来满足这一需求。然而,随着ULFM的使用越来越广泛,许多潜在用户担心它的复杂性和重写现有代码的必要性。在本文中,我们提出了一种类似于现有代码中常见用法的使用模型,但不需要用户重新启动应用程序(从而产生重新进入批处理队列的成本,启动成本等)。我们在MPICH(一种流行的开源MPI实现)中使用了ULFM的新实现,并使用蒙特卡洛通信内核演示了ULFM的使用,蒙特卡洛通信内核是由先进反应堆百万亿次模拟中心开发的代理应用程序。结果表明,所使用的方法会导致一定程度的代码入侵,类似于现有的检查点/重启模型,但开销更少。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信