{"title":"Compiler-Assisted Application-Level Checkpointing for MPI Programs","authors":"Xuejun Yang, Panfeng Wang, Hongyi Fu, Yunfei Du, Zhiyuan Wang, Jia Jia","doi":"10.1109/ICDCS.2008.25","DOIUrl":null,"url":null,"abstract":"Application-level checkpointing can decrease the overhead of fault tolerance by minimizing the amount of checkpoint data. However this technique requires the programmer to manually choose the critical data that should be saved. In this paper, we firstly propose a live-variable analysis method for MPI programs. Then, we provide an optimization method of data saving for application-level checkpointing based on the analysis method. Based on the theoretical foundation, we implement a source-to-source precompiler (ALEC) to automate application-level checkpointing. Finally, we evaluate the performance of five FORTRAN/MPI programs which are transformed and integrated checkpointing features by ALEC on a 512-CPU cluster system. The experimental results show that i) the application-level checkpointing based on live-variable analysis for MPI programs can efficiently reduce the amount of checkpoint data, thereby decrease the overhead of checkpoint and restart; ii) ALEC is capable of automating application-level checkpointing correctly and effectively.","PeriodicalId":240205,"journal":{"name":"2008 The 28th International Conference on Distributed Computing Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 The 28th International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2008.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Application-level checkpointing can decrease the overhead of fault tolerance by minimizing the amount of checkpoint data. However this technique requires the programmer to manually choose the critical data that should be saved. In this paper, we firstly propose a live-variable analysis method for MPI programs. Then, we provide an optimization method of data saving for application-level checkpointing based on the analysis method. Based on the theoretical foundation, we implement a source-to-source precompiler (ALEC) to automate application-level checkpointing. Finally, we evaluate the performance of five FORTRAN/MPI programs which are transformed and integrated checkpointing features by ALEC on a 512-CPU cluster system. The experimental results show that i) the application-level checkpointing based on live-variable analysis for MPI programs can efficiently reduce the amount of checkpoint data, thereby decrease the overhead of checkpoint and restart; ii) ALEC is capable of automating application-level checkpointing correctly and effectively.