{"title":"编译器支持自动检查点","authors":"Sung-Eun Choi, Steven J. Deitz","doi":"10.1109/HPCSA.2002.1019157","DOIUrl":null,"url":null,"abstract":"Checkpointing is a key technology for applications on large cluster computer systems. As cluster sizes grow, component failures will become a normal part of operation, and applications will have to deal more directly with repeated failures during program runs. We describe automatic checkpointing in the ZPL compiler and its advantages over traditional library or system-based approaches that have no information about application behavior. We show that even naive compiler-inserted checkpoints can significantly reduce the size of the checkpoint recovery data, up to 73% in our application suite. We also introduce the notion of checkpoint ranges, a range of code where processors can perform a local checkpoint at any time during the range. The compiler guarantees that these local checkpoints form a globally consistent checkpoint without global coordination by ensuring that there are no in-flight messages during the checkpoint range. Checkpoint ranges help further alleviate any additional network congestion caused by checkpointing.","PeriodicalId":111862,"journal":{"name":"Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Compiler support for automatic checkpointing\",\"authors\":\"Sung-Eun Choi, Steven J. Deitz\",\"doi\":\"10.1109/HPCSA.2002.1019157\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Checkpointing is a key technology for applications on large cluster computer systems. As cluster sizes grow, component failures will become a normal part of operation, and applications will have to deal more directly with repeated failures during program runs. We describe automatic checkpointing in the ZPL compiler and its advantages over traditional library or system-based approaches that have no information about application behavior. We show that even naive compiler-inserted checkpoints can significantly reduce the size of the checkpoint recovery data, up to 73% in our application suite. We also introduce the notion of checkpoint ranges, a range of code where processors can perform a local checkpoint at any time during the range. The compiler guarantees that these local checkpoints form a globally consistent checkpoint without global coordination by ensuring that there are no in-flight messages during the checkpoint range. Checkpoint ranges help further alleviate any additional network congestion caused by checkpointing.\",\"PeriodicalId\":111862,\"journal\":{\"name\":\"Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCSA.2002.1019157\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSA.2002.1019157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Checkpointing is a key technology for applications on large cluster computer systems. As cluster sizes grow, component failures will become a normal part of operation, and applications will have to deal more directly with repeated failures during program runs. We describe automatic checkpointing in the ZPL compiler and its advantages over traditional library or system-based approaches that have no information about application behavior. We show that even naive compiler-inserted checkpoints can significantly reduce the size of the checkpoint recovery data, up to 73% in our application suite. We also introduce the notion of checkpoint ranges, a range of code where processors can perform a local checkpoint at any time during the range. The compiler guarantees that these local checkpoints form a globally consistent checkpoint without global coordination by ensuring that there are no in-flight messages during the checkpoint range. Checkpoint ranges help further alleviate any additional network congestion caused by checkpointing.