RDMA-Based Job Migration Framework for MPI over InfiniBand

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI:10.1109/CLUSTER.2010.20

Xiangyong Ouyang, Sonya Marcarelli, R. Rajachandrasekar, D. Panda

{"title":"RDMA-Based Job Migration Framework for MPI over InfiniBand","authors":"Xiangyong Ouyang, Sonya Marcarelli, R. Rajachandrasekar, D. Panda","doi":"10.1109/CLUSTER.2010.20","DOIUrl":null,"url":null,"abstract":"Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly large-sized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2010.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly large-sized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters.

查看原文本刊更多论文

基于rdma的MPI ib作业迁移框架

协调检查点和恢复是实现大规模系统容错的常用方法。传统机制将进程映像转储到并行作业中涉及的所有进程的本地磁盘或中央存储区域。发生故障时，将重新启动进程并将其恢复到最新的检查点映像。然而，这种方法无法提供越来越大的作业所需的可伸缩性，因为它给存储子系统带来了沉重的I/O负担，并且在重启阶段重新提交作业会导致长时间的排队延迟。在本文中，我们通过使用主动作业迁移方案来增强开源高性能MPI-2实现MVAPICH2的容错性。我们不是检查点作业的所有进程并将它们的进程映像保存到稳定的存储中，而是将运行状况恶化的节点上运行的进程转移到运行状况良好的备用节点，并从备用节点恢复这些进程。基于rdma的过程图像传输旨在利用InfiniBand的高性能通信。实验结果表明，对于运行在8个计算节点上的64进程应用程序，Job Migration方案处理节点故障的速度是Checkpoint/Restart方案的4.49倍。据我们所知，这是第一个针对基于infiniband的集群的作业迁移设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量