分布式系统的高效检查点算法

International Journal of Engineering in Computer Science Pub Date : 2019-07-01 DOI:10.33545/26633582.2019.v1.i2a.22

N. Rathore, Jyoti Rathore

{"title":"分布式系统的高效检查点算法","authors":"N. Rathore, Jyoti Rathore","doi":"10.33545/26633582.2019.v1.i2a.22","DOIUrl":null,"url":null,"abstract":"The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation and interactions between distributed services and resources, has led to the start of new era of computing. Tremendously large number and the heterogeneous nature of Grid Computing resource make the resource management a significantly challenging job. Resource management scenarios often include resource discovery, resource monitoring, resource inventories, resource provisioning, fault isolation, variety of autonomic capabilities and service level management activities. Out of this fault tolerance has become the main topic of research as till date there is no single system that can be called as the complete system that will handle all the faults in grids. Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, uncoordinated and communication-induced algorithms. In this paper, a checkpointing algorithm that has minimum checkpointing counts equivalent to periodic checkpointing algorithm has been proposed. For relatively short rollback distance at faulty situations and produces better performance rather than other algorithms in terms of task completion time, in both fault-free and faulty situations. This algorithm has been implemented in Alchemi.NET because it did not currently support any","PeriodicalId":147954,"journal":{"name":"International Journal of Engineering in Computer Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Efficient checkpoint algorithm for distributed system\",\"authors\":\"N. Rathore, Jyoti Rathore\",\"doi\":\"10.33545/26633582.2019.v1.i2a.22\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation and interactions between distributed services and resources, has led to the start of new era of computing. Tremendously large number and the heterogeneous nature of Grid Computing resource make the resource management a significantly challenging job. Resource management scenarios often include resource discovery, resource monitoring, resource inventories, resource provisioning, fault isolation, variety of autonomic capabilities and service level management activities. Out of this fault tolerance has become the main topic of research as till date there is no single system that can be called as the complete system that will handle all the faults in grids. Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, uncoordinated and communication-induced algorithms. In this paper, a checkpointing algorithm that has minimum checkpointing counts equivalent to periodic checkpointing algorithm has been proposed. For relatively short rollback distance at faulty situations and produces better performance rather than other algorithms in terms of task completion time, in both fault-free and faulty situations. This algorithm has been implemented in Alchemi.NET because it did not currently support any\",\"PeriodicalId\":147954,\"journal\":{\"name\":\"International Journal of Engineering in Computer Science\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Engineering in Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.33545/26633582.2019.v1.i2a.22\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Engineering in Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33545/26633582.2019.v1.i2a.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

网格作为协调资源共享和解决多机构虚拟组织中的问题的手段迅速出现，同时提供对全球资源的可靠、一致、普遍的访问。计算网格的出现以及分布式服务和资源之间无缝聚合和交互的潜力，导致了计算新时代的开始。网格计算资源的巨大数量和异构性使资源管理成为一项极具挑战性的工作。资源管理场景通常包括资源发现、资源监控、资源清单、资源供应、故障隔离、各种自治功能和服务级别管理活动。因此，目前还没有一个系统可以作为一个完整的系统来处理网格中的所有故障。检查点是一种容错技术，可以快速恢复故障并重新启动作业。分布式系统的检查点算法已经被研究了很多年。这些算法可分为三类:协调算法、非协调算法和通信诱导算法。本文提出了一种与周期性检查点算法等价的检查点计数最小的检查点算法。在故障情况下的回滚距离相对较短，在任务完成时间方面优于其他算法，无论是在无故障情况下还是在故障情况下。该算法已在Alchemi中实现。. NET，因为它目前不支持

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient checkpoint algorithm for distributed system

The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation and interactions between distributed services and resources, has led to the start of new era of computing. Tremendously large number and the heterogeneous nature of Grid Computing resource make the resource management a significantly challenging job. Resource management scenarios often include resource discovery, resource monitoring, resource inventories, resource provisioning, fault isolation, variety of autonomic capabilities and service level management activities. Out of this fault tolerance has become the main topic of research as till date there is no single system that can be called as the complete system that will handle all the faults in grids. Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, uncoordinated and communication-induced algorithms. In this paper, a checkpointing algorithm that has minimum checkpointing counts equivalent to periodic checkpointing algorithm has been proposed. For relatively short rollback distance at faulty situations and produces better performance rather than other algorithms in terms of task completion time, in both fault-free and faulty situations. This algorithm has been implemented in Alchemi.NET because it did not currently support any

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Engineering in Computer Science

自引率

0.00%

发文量