{"title":"Implementing efficient message logging protocols as MPI application extensions","authors":"K. Dichev, Dimitrios S. Nikolopoulos","doi":"10.1145/3343211.3343219","DOIUrl":null,"url":null,"abstract":"Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations do exist, but not a single one of them can be easily deployed today by more than a few experts. In contrast, in this work we build efficient message logging capabilities on top of an MPI library with no message logging capabilities; we do so for two different send-deterministic HPC kernels, one with a global exchange pattern (CG), and one with a neighbour exchange pattern (LULESH). While our library of choice ULFM detects failure and recovers MPI communicators, we build on that to then restore the intra- and inter-process data consistency of both applications. This task follows a similar pattern across these kernels, and we present our methodology in a generic way. In the end, our extensions provide message logging capabilities for each kernel, without the need for an actual message logging runtime underneath. On the performance side, we eliminate event logging for these kernels, and design a flexible user-defined hybrid between global and local rollback. Our extensions span a few hundred lines of code for each kernel, are open-sourced, and enable local and global rollback after process failure.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3343211.3343219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations do exist, but not a single one of them can be easily deployed today by more than a few experts. In contrast, in this work we build efficient message logging capabilities on top of an MPI library with no message logging capabilities; we do so for two different send-deterministic HPC kernels, one with a global exchange pattern (CG), and one with a neighbour exchange pattern (LULESH). While our library of choice ULFM detects failure and recovers MPI communicators, we build on that to then restore the intra- and inter-process data consistency of both applications. This task follows a similar pattern across these kernels, and we present our methodology in a generic way. In the end, our extensions provide message logging capabilities for each kernel, without the need for an actual message logging runtime underneath. On the performance side, we eliminate event logging for these kernels, and design a flexible user-defined hybrid between global and local rollback. Our extensions span a few hundred lines of code for each kernel, are open-sourced, and enable local and global rollback after process failure.