{"title":"网格环境中的应用程序检查点,通过复制提高检查点可靠性","authors":"R. K. Bawa, R. Singh","doi":"10.1109/ICCCNT.2012.6395974","DOIUrl":null,"url":null,"abstract":"Grid technologies are emerging as the next generation of distributed computing, allowing the aggregation of heterogeneous resources that are geographically distributed. The heterogeneous nature of the grid makes it more vulnerable to faults which lead to either the failure of the job or delay in completing the execution of the job. Checkpointing is one of the many fault tolerance techniques which are used to make Grid more efficient and reliable. In this paper we have developed an application checkpointing based fault tolerance technique for Alchemi based Grid environment. In this technique application threads generate their checkpoints and store them in the checkpoint table at the manager node. In case a thread fails checkpoint of the corresponding thread is used to resume the execution from the point of failure. This technique introduces a slight overhead in fault free situations but very effective in case of a node failure. Increased checkpoint frequency improves job's resuming capability but also increases the overhead of generating and storing checkpoints which results in increased processing time of the job.","PeriodicalId":364589,"journal":{"name":"2012 Third International Conference on Computing, Communication and Networking Technologies (ICCCNT'12)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2012-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Application checkpointing in grid environment with improved checkpoint reliability through replication\",\"authors\":\"R. K. Bawa, R. Singh\",\"doi\":\"10.1109/ICCCNT.2012.6395974\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Grid technologies are emerging as the next generation of distributed computing, allowing the aggregation of heterogeneous resources that are geographically distributed. The heterogeneous nature of the grid makes it more vulnerable to faults which lead to either the failure of the job or delay in completing the execution of the job. Checkpointing is one of the many fault tolerance techniques which are used to make Grid more efficient and reliable. In this paper we have developed an application checkpointing based fault tolerance technique for Alchemi based Grid environment. In this technique application threads generate their checkpoints and store them in the checkpoint table at the manager node. In case a thread fails checkpoint of the corresponding thread is used to resume the execution from the point of failure. This technique introduces a slight overhead in fault free situations but very effective in case of a node failure. Increased checkpoint frequency improves job's resuming capability but also increases the overhead of generating and storing checkpoints which results in increased processing time of the job.\",\"PeriodicalId\":364589,\"journal\":{\"name\":\"2012 Third International Conference on Computing, Communication and Networking Technologies (ICCCNT'12)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 Third International Conference on Computing, Communication and Networking Technologies (ICCCNT'12)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCCNT.2012.6395974\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Third International Conference on Computing, Communication and Networking Technologies (ICCCNT'12)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCNT.2012.6395974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Application checkpointing in grid environment with improved checkpoint reliability through replication
Grid technologies are emerging as the next generation of distributed computing, allowing the aggregation of heterogeneous resources that are geographically distributed. The heterogeneous nature of the grid makes it more vulnerable to faults which lead to either the failure of the job or delay in completing the execution of the job. Checkpointing is one of the many fault tolerance techniques which are used to make Grid more efficient and reliable. In this paper we have developed an application checkpointing based fault tolerance technique for Alchemi based Grid environment. In this technique application threads generate their checkpoints and store them in the checkpoint table at the manager node. In case a thread fails checkpoint of the corresponding thread is used to resume the execution from the point of failure. This technique introduces a slight overhead in fault free situations but very effective in case of a node failure. Increased checkpoint frequency improves job's resuming capability but also increases the overhead of generating and storing checkpoints which results in increased processing time of the job.