作为MPI应用程序扩展实现高效的消息日志记录协议

K. Dichev, Dimitrios S. Nikolopoulos
{"title":"作为MPI应用程序扩展实现高效的消息日志记录协议","authors":"K. Dichev, Dimitrios S. Nikolopoulos","doi":"10.1145/3343211.3343219","DOIUrl":null,"url":null,"abstract":"Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations do exist, but not a single one of them can be easily deployed today by more than a few experts. In contrast, in this work we build efficient message logging capabilities on top of an MPI library with no message logging capabilities; we do so for two different send-deterministic HPC kernels, one with a global exchange pattern (CG), and one with a neighbour exchange pattern (LULESH). While our library of choice ULFM detects failure and recovers MPI communicators, we build on that to then restore the intra- and inter-process data consistency of both applications. This task follows a similar pattern across these kernels, and we present our methodology in a generic way. In the end, our extensions provide message logging capabilities for each kernel, without the need for an actual message logging runtime underneath. On the performance side, we eliminate event logging for these kernels, and design a flexible user-defined hybrid between global and local rollback. Our extensions span a few hundred lines of code for each kernel, are open-sourced, and enable local and global rollback after process failure.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Implementing efficient message logging protocols as MPI application extensions\",\"authors\":\"K. Dichev, Dimitrios S. Nikolopoulos\",\"doi\":\"10.1145/3343211.3343219\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations do exist, but not a single one of them can be easily deployed today by more than a few experts. In contrast, in this work we build efficient message logging capabilities on top of an MPI library with no message logging capabilities; we do so for two different send-deterministic HPC kernels, one with a global exchange pattern (CG), and one with a neighbour exchange pattern (LULESH). While our library of choice ULFM detects failure and recovers MPI communicators, we build on that to then restore the intra- and inter-process data consistency of both applications. This task follows a similar pattern across these kernels, and we present our methodology in a generic way. In the end, our extensions provide message logging capabilities for each kernel, without the need for an actual message logging runtime underneath. On the performance side, we eliminate event logging for these kernels, and design a flexible user-defined hybrid between global and local rollback. Our extensions span a few hundred lines of code for each kernel, are open-sourced, and enable local and global rollback after process failure.\",\"PeriodicalId\":314904,\"journal\":{\"name\":\"Proceedings of the 26th European MPI Users' Group Meeting\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3343211.3343219\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3343211.3343219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

消息日志记录协议支持本地回滚,这是容错MPI应用程序比全局回滚更有效的替代方案。到目前为止,消息日志MPI实现已经产生了重新设计和重新部署MPI库的开销,以及跨各种内核的持续性能损失。对于消息日志记录实现的成功研究工作确实存在,但其中没有一个可以由几个专家轻松部署。相反,在这项工作中,我们在没有消息记录功能的MPI库之上构建了高效的消息记录功能;我们对两个不同的发送确定性HPC内核这样做,一个具有全局交换模式(CG),一个具有邻居交换模式(LULESH)。当我们选择的ULFM库检测故障并恢复MPI通信器时,我们在此基础上构建,然后恢复两个应用程序的进程内和进程间数据一致性。这项任务在这些内核中遵循类似的模式,我们以一种通用的方式呈现我们的方法。最后,我们的扩展为每个内核提供了消息日志记录功能,而不需要底层实际的消息日志记录运行时。在性能方面,我们消除了这些内核的事件日志记录,并在全局回滚和本地回滚之间设计了灵活的用户定义混合。我们的扩展跨越了每个内核的几百行代码,是开源的,并且在进程失败后支持本地和全局回滚。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Implementing efficient message logging protocols as MPI application extensions
Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations do exist, but not a single one of them can be easily deployed today by more than a few experts. In contrast, in this work we build efficient message logging capabilities on top of an MPI library with no message logging capabilities; we do so for two different send-deterministic HPC kernels, one with a global exchange pattern (CG), and one with a neighbour exchange pattern (LULESH). While our library of choice ULFM detects failure and recovers MPI communicators, we build on that to then restore the intra- and inter-process data consistency of both applications. This task follows a similar pattern across these kernels, and we present our methodology in a generic way. In the end, our extensions provide message logging capabilities for each kernel, without the need for an actual message logging runtime underneath. On the performance side, we eliminate event logging for these kernels, and design a flexible user-defined hybrid between global and local rollback. Our extensions span a few hundred lines of code for each kernel, are open-sourced, and enable local and global rollback after process failure.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信