An Application Level Approach for Proactive Process Migration in MPI Applications

2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies Pub Date : 2011-10-20 DOI:10.1109/PDCAT.2011.16

Iván Cores, Gabriel Rodríguez, P. González, María J. Martín

{"title":"An Application Level Approach for Proactive Process Migration in MPI Applications","authors":"Iván Cores, Gabriel Rodríguez, P. González, María J. Martín","doi":"10.1109/PDCAT.2011.16","DOIUrl":null,"url":null,"abstract":"The running times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that not all computation done is lost on machine failures. Check pointing and rollback recovery is a very useful technique to implement fault-tolerant applications. However, when a failure occurs, most check pointing mechanisms require a complete restart of the parallel application from the last checkpoint. This affects the efficiency of the solution, leading to an unnecessary overhead that can be avoided through a single process migration in case of failure. Although research has been carried out in this field, the solutions proposed in the literature are commonly tied to specific implementations of the parallel communication APIs or to specific runtime environments. The approach presented in this work extends an application level check pointing framework to proactively migrate MPI processes from processors when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: transparency for the user, achieved through the use of a compiler tool and a runtime library, and portability since it is not locked into a particular MPI implementation.","PeriodicalId":137617,"journal":{"name":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2011.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The running times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that not all computation done is lost on machine failures. Check pointing and rollback recovery is a very useful technique to implement fault-tolerant applications. However, when a failure occurs, most check pointing mechanisms require a complete restart of the parallel application from the last checkpoint. This affects the efficiency of the solution, leading to an unnecessary overhead that can be avoided through a single process migration in case of failure. Although research has been carried out in this field, the solutions proposed in the literature are commonly tied to specific implementations of the parallel communication APIs or to specific runtime environments. The approach presented in this work extends an application level check pointing framework to proactively migrate MPI processes from processors when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: transparency for the user, achieved through the use of a compiler tool and a runtime library, and portability since it is not locked into a particular MPI implementation.

查看原文本刊更多论文

MPI应用程序中主动进程迁移的应用程序级方法

大规模计算科学和工程并行应用程序的运行时间通常长于平均故障间隔时间(MTBF)。并行应用程序必须容忍硬件故障，以确保在机器故障时不会丢失所有已完成的计算。检查指向和回滚恢复是实现容错应用程序的一种非常有用的技术。但是，当发生故障时，大多数检查点机制要求从上一个检查点完全重新启动并行应用程序。这会影响解决方案的效率，导致不必要的开销，而在出现故障时，可以通过单个流程迁移来避免这种开销。虽然在这个领域已经进行了研究，但文献中提出的解决方案通常与并行通信api的特定实现或特定的运行时环境相关联。本工作中提出的方法扩展了应用程序级检查指向框架，以便在通知即将发生故障时主动从处理器迁移MPI进程，而无需重新启动整个应用程序。提出的解决方案的主要特点是:对用户透明，通过使用编译器工具和运行时库实现，以及可移植性，因为它没有被锁定在特定的MPI实现中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies

自引率

0.00%

发文量