Jiannong Cao, Yifeng Chen, Kang Zhang, Yanxiang He
{"title":"混合分布式系统中的检查点","authors":"Jiannong Cao, Yifeng Chen, Kang Zhang, Yanxiang He","doi":"10.1109/ISPAN.2004.1300471","DOIUrl":null,"url":null,"abstract":"To provide fault tolerance to computer systems suffering from transient faults, checkpointing and rollback recovery is widely-used. Among other techniques, two primary checkpointing schemes have been proposed: independent and coordinated schemes. However, most existing work addresses only the need to employ a single checkpointing and rollback recovery scheme to a target system. In this paper, issues are discussed and a new algorithm is developed to address the need of integrating independent and coordinated checkpointing schemes for applications running in a hybrid distributed environment containing multiple heterogeneous subsystems. The required changes to the original checkpointing schemes for each subsystem and the overall prevented unnecessary rollbacks for the integrated system are presented. Also described is an algorithm for collecting garbage checkpoints in the combined hybrid system.","PeriodicalId":198404,"journal":{"name":"7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2004-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Checkpointing in hybrid distributed systems\",\"authors\":\"Jiannong Cao, Yifeng Chen, Kang Zhang, Yanxiang He\",\"doi\":\"10.1109/ISPAN.2004.1300471\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To provide fault tolerance to computer systems suffering from transient faults, checkpointing and rollback recovery is widely-used. Among other techniques, two primary checkpointing schemes have been proposed: independent and coordinated schemes. However, most existing work addresses only the need to employ a single checkpointing and rollback recovery scheme to a target system. In this paper, issues are discussed and a new algorithm is developed to address the need of integrating independent and coordinated checkpointing schemes for applications running in a hybrid distributed environment containing multiple heterogeneous subsystems. The required changes to the original checkpointing schemes for each subsystem and the overall prevented unnecessary rollbacks for the integrated system are presented. Also described is an algorithm for collecting garbage checkpoints in the combined hybrid system.\",\"PeriodicalId\":198404,\"journal\":{\"name\":\"7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings.\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISPAN.2004.1300471\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPAN.2004.1300471","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
To provide fault tolerance to computer systems suffering from transient faults, checkpointing and rollback recovery is widely-used. Among other techniques, two primary checkpointing schemes have been proposed: independent and coordinated schemes. However, most existing work addresses only the need to employ a single checkpointing and rollback recovery scheme to a target system. In this paper, issues are discussed and a new algorithm is developed to address the need of integrating independent and coordinated checkpointing schemes for applications running in a hybrid distributed environment containing multiple heterogeneous subsystems. The required changes to the original checkpointing schemes for each subsystem and the overall prevented unnecessary rollbacks for the integrated system are presented. Also described is an algorithm for collecting garbage checkpoints in the combined hybrid system.