Application-level checkpointing for shared memory programs

ASPLOS XI Pub Date : 2004-10-07 DOI:10.1145/1024393.1024421

G. Bronevetsky, Daniel Marques, K. Pingali, P. Szwed, M. Schulz

{"title":"Application-level checkpointing for shared memory programs","authors":"G. Bronevetsky, Daniel Marques, K. Pingali, P. Szwed, M. Schulz","doi":"10.1145/1024393.1024421","DOIUrl":null,"url":null,"abstract":"Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"131","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASPLOS XI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1024393.1024421","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 131

Abstract

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

查看原文本刊更多论文

共享内存程序的应用程序级检查点

高性能计算的趋势使得长时间运行的应用程序必须能够容忍硬件故障。最常用的方法是检查点和重启(CPR)——计算的状态定期保存在磁盘上，当发生故障时，计算从上次保存的状态重新开始。目前，程序设计人员的职责是对CPR的应用程序进行仪表化。我们的小组正在研究使用编译器技术来检测代码，使它们能够自我检查点和自我重启，从而为使长时间运行的科学应用程序能够适应硬件故障的问题提供一个自动解决方案。我们以前的工作集中在消息传递程序上。本文描述了一个在对称多处理器上运行共享内存程序的系统。这个系统有两个组成部分:(i)一个用于应用程序源到源修改的预编译器，(ii)一个运行时系统，它实现了在并行应用程序的线程之间协调CPR的协议。为了具体起见，我们将重点放在OpenMP的一个重要子集上，其中包括屏障和锁。这种方法的优点之一是，容忍错误的能力嵌入到应用程序本身中，因此应用程序可以在任何平台上实现自我检查点和自我重启。我们通过展示转换后的基准可以在三个不同的平台(Windows/x86、Linux/x86和Tru64/Alpha)上检查点和重新启动来演示这一点。我们的实验表明，这种方法带来的开销通常非常小;他们还提出了调整当前实现以进一步减少开销的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ASPLOS XI

自引率

0.00%

发文量