Mini-Ckpts: Surviving OS Failures in Persistent Memory

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI:10.1145/2925426.2926295

David Fiala, F. Mueller, Kurt B. Ferreira, C. Engelmann

{"title":"Mini-Ckpts: Surviving OS Failures in Persistent Memory","authors":"David Fiala, F. Mueller, Kurt B. Ferreira, C. Engelmann","doi":"10.1145/2925426.2926295","DOIUrl":null,"url":null,"abstract":"Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory may be more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs---and the parallel nature of HPC applications means any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures in a robust system. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime systems can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current coarse-grained application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional faults.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"142 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory may be more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs---and the parallel nature of HPC applications means any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures in a robust system. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime systems can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current coarse-grained application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional faults.

查看原文本刊更多论文

Mini-Ckpts:在持久内存中幸存操作系统故障

高性能计算(HPC)社区越来越关注未来超大规模系统的可靠性。目前的工作主要集中在应用程序容错而不是操作系统(OS)上，尽管最近的研究表明操作系统内存中的故障可能更有可能发生。操作系统对于系统正确、高效地运行节点和进程至关重要，而HPC应用程序的并行特性意味着，由于HPC中的紧密通信，任何单个节点的故障通常都会迫使该应用程序的所有进程终止。因此，操作系统本身必须能够容忍健壮系统中的故障。在这项工作中，我们介绍了mini-ckpts，这是一个框架，即使发生致命的操作系统故障或崩溃，也能使应用程序存活。Mini-ckpts通过确保在故障发生之前将描述进程的关键数据保存在持久内存中来实现这种容忍度。故障发生后，操作系统通过热重启恢复活力，应用程序继续执行，有效地使故障和重启透明。据测量，mini-ckpts恢复和恢复过程需要3到6秒，对于许多关键的HPC工作负载，其无故障开销在3-5%之间。与当前的容错方法相比，这项工作确保了操作和运行时系统在存在故障的情况下可以继续运行。与当前以应用程序为中心的粗粒度方法相比，这是一种更细粒度的动态容错方法。在此级别处理故障有可能大大减少开销，并能够减轻额外的故障。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量