An Application Level Approach for Proactive Process Migration in MPI Applications

Iván Cores, Gabriel Rodríguez, P. González, María J. Martín
{"title":"An Application Level Approach for Proactive Process Migration in MPI Applications","authors":"Iván Cores, Gabriel Rodríguez, P. González, María J. Martín","doi":"10.1109/PDCAT.2011.16","DOIUrl":null,"url":null,"abstract":"The running times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that not all computation done is lost on machine failures. Check pointing and rollback recovery is a very useful technique to implement fault-tolerant applications. However, when a failure occurs, most check pointing mechanisms require a complete restart of the parallel application from the last checkpoint. This affects the efficiency of the solution, leading to an unnecessary overhead that can be avoided through a single process migration in case of failure. Although research has been carried out in this field, the solutions proposed in the literature are commonly tied to specific implementations of the parallel communication APIs or to specific runtime environments. The approach presented in this work extends an application level check pointing framework to proactively migrate MPI processes from processors when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: transparency for the user, achieved through the use of a compiler tool and a runtime library, and portability since it is not locked into a particular MPI implementation.","PeriodicalId":137617,"journal":{"name":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2011.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The running times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that not all computation done is lost on machine failures. Check pointing and rollback recovery is a very useful technique to implement fault-tolerant applications. However, when a failure occurs, most check pointing mechanisms require a complete restart of the parallel application from the last checkpoint. This affects the efficiency of the solution, leading to an unnecessary overhead that can be avoided through a single process migration in case of failure. Although research has been carried out in this field, the solutions proposed in the literature are commonly tied to specific implementations of the parallel communication APIs or to specific runtime environments. The approach presented in this work extends an application level check pointing framework to proactively migrate MPI processes from processors when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: transparency for the user, achieved through the use of a compiler tool and a runtime library, and portability since it is not locked into a particular MPI implementation.
MPI应用程序中主动进程迁移的应用程序级方法
大规模计算科学和工程并行应用程序的运行时间通常长于平均故障间隔时间(MTBF)。并行应用程序必须容忍硬件故障,以确保在机器故障时不会丢失所有已完成的计算。检查指向和回滚恢复是实现容错应用程序的一种非常有用的技术。但是,当发生故障时,大多数检查点机制要求从上一个检查点完全重新启动并行应用程序。这会影响解决方案的效率,导致不必要的开销,而在出现故障时,可以通过单个流程迁移来避免这种开销。虽然在这个领域已经进行了研究,但文献中提出的解决方案通常与并行通信api的特定实现或特定的运行时环境相关联。本工作中提出的方法扩展了应用程序级检查指向框架,以便在通知即将发生故障时主动从处理器迁移MPI进程,而无需重新启动整个应用程序。提出的解决方案的主要特点是:对用户透明,通过使用编译器工具和运行时库实现,以及可移植性,因为它没有被锁定在特定的MPI实现中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信