MPI操作级检查点/回滚的建议和一个实现

Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06) Pub Date : 2006-05-16 DOI:10.1109/CCGRID.2006.81

Yuan Tang, G. Fagg, J. Dongarra

{"title":"MPI操作级检查点/回滚的建议和一个实现","authors":"Yuan Tang, G. Fagg, J. Dongarra","doi":"10.1109/CCGRID.2006.81","DOIUrl":null,"url":null,"abstract":"With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI’s capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user’s data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be recover-and-continue against the more traditional stop-and-restart model. Recover-and-continue means that in case an error occurs, we just re-spawn the failed processes. All the remaining living processes stay in their original processors mapping on memory. The main benefits of recover-and-continue are much less cost for system recovery and the opportunity of employing in-memory checkpoint/ rollback techniques. Compared with stable or local disk techniques, which are the only choices for stop-andrestart, doubtlessly, the in-memory approach significantly reduces the performance penalty in checkpoint/rollback. Additionally, it makes it possible to establish a concurrent multiple level checkpoint/ rollback framework. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system will be gradually unveiled.","PeriodicalId":419226,"journal":{"name":"Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation\",\"authors\":\"Yuan Tang, G. Fagg, J. Dongarra\",\"doi\":\"10.1109/CCGRID.2006.81\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI’s capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user’s data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be recover-and-continue against the more traditional stop-and-restart model. Recover-and-continue means that in case an error occurs, we just re-spawn the failed processes. All the remaining living processes stay in their original processors mapping on memory. The main benefits of recover-and-continue are much less cost for system recovery and the opportunity of employing in-memory checkpoint/ rollback techniques. Compared with stable or local disk techniques, which are the only choices for stop-andrestart, doubtlessly, the in-memory approach significantly reduces the performance penalty in checkpoint/rollback. Additionally, it makes it possible to establish a concurrent multiple level checkpoint/ rollback framework. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system will be gradually unveiled.\",\"PeriodicalId\":419226,\"journal\":{\"name\":\"Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2006.81\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2006.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

随着现代高性能计算系统中处理器数量的不断增加，出现了两个亟待解决的问题。一个是可伸缩性，另一个是容错性。在我们之前的工作中，我们通过为恢复方法、通信器、消息模式等指定一个系统框架来扩展MPI规范，以处理容错，这些框架定义了MPI在发生错误时的行为。这些扩展不仅指定了MPI库和RTE(运行时环境)的实现如何处理系统级的故障，而且还为普通的HPC应用程序开发人员提供了具有不同性能和成本的各种恢复选择。在本文中，我们继续在这个方向上扩展MPI的能力。首先，我们提出了一个MPI操作级检查点/回滚库来恢复用户数据。更重要的是，我们认为容错MPI应用程序的未来生成编程模型应该是恢复和继续，而不是更传统的停止和重新启动模型。恢复并继续意味着在发生错误的情况下，我们只是重新生成失败的进程。所有剩余的活进程都留在它们映射到内存上的原始处理器中。恢复并继续的主要好处是系统恢复的成本要低得多，并且有机会使用内存中的检查点/回滚技术。与稳定或本地磁盘技术(停止-重新启动的唯一选择)相比，内存中的方法无疑大大减少了检查点/回滚中的性能损失。此外，它使建立并发多级检查点/回滚框架成为可能。随着我们工作的不断深入，下一代高性能计算容错系统的层次结构将逐渐呈现出来。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation

With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI’s capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user’s data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be recover-and-continue against the more traditional stop-and-restart model. Recover-and-continue means that in case an error occurs, we just re-spawn the failed processes. All the remaining living processes stay in their original processors mapping on memory. The main benefits of recover-and-continue are much less cost for system recovery and the opportunity of employing in-memory checkpoint/ rollback techniques. Compared with stable or local disk techniques, which are the only choices for stop-andrestart, doubtlessly, the in-memory approach significantly reduces the performance penalty in checkpoint/rollback. Additionally, it makes it possible to establish a concurrent multiple level checkpoint/ rollback framework. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system will be gradually unveiled.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)

自引率

0.00%

发文量