分布式系统的高效检查点算法

N. Rathore, Jyoti Rathore
{"title":"分布式系统的高效检查点算法","authors":"N. Rathore, Jyoti Rathore","doi":"10.33545/26633582.2019.v1.i2a.22","DOIUrl":null,"url":null,"abstract":"The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation and interactions between distributed services and resources, has led to the start of new era of computing. Tremendously large number and the heterogeneous nature of Grid Computing resource make the resource management a significantly challenging job. Resource management scenarios often include resource discovery, resource monitoring, resource inventories, resource provisioning, fault isolation, variety of autonomic capabilities and service level management activities. Out of this fault tolerance has become the main topic of research as till date there is no single system that can be called as the complete system that will handle all the faults in grids. Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, uncoordinated and communication-induced algorithms. In this paper, a checkpointing algorithm that has minimum checkpointing counts equivalent to periodic checkpointing algorithm has been proposed. For relatively short rollback distance at faulty situations and produces better performance rather than other algorithms in terms of task completion time, in both fault-free and faulty situations. This algorithm has been implemented in Alchemi.NET because it did not currently support any","PeriodicalId":147954,"journal":{"name":"International Journal of Engineering in Computer Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Efficient checkpoint algorithm for distributed system\",\"authors\":\"N. Rathore, Jyoti Rathore\",\"doi\":\"10.33545/26633582.2019.v1.i2a.22\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation and interactions between distributed services and resources, has led to the start of new era of computing. Tremendously large number and the heterogeneous nature of Grid Computing resource make the resource management a significantly challenging job. Resource management scenarios often include resource discovery, resource monitoring, resource inventories, resource provisioning, fault isolation, variety of autonomic capabilities and service level management activities. Out of this fault tolerance has become the main topic of research as till date there is no single system that can be called as the complete system that will handle all the faults in grids. Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, uncoordinated and communication-induced algorithms. In this paper, a checkpointing algorithm that has minimum checkpointing counts equivalent to periodic checkpointing algorithm has been proposed. For relatively short rollback distance at faulty situations and produces better performance rather than other algorithms in terms of task completion time, in both fault-free and faulty situations. This algorithm has been implemented in Alchemi.NET because it did not currently support any\",\"PeriodicalId\":147954,\"journal\":{\"name\":\"International Journal of Engineering in Computer Science\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Engineering in Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.33545/26633582.2019.v1.i2a.22\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Engineering in Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33545/26633582.2019.v1.i2a.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

网格作为协调资源共享和解决多机构虚拟组织中的问题的手段迅速出现,同时提供对全球资源的可靠、一致、普遍的访问。计算网格的出现以及分布式服务和资源之间无缝聚合和交互的潜力,导致了计算新时代的开始。网格计算资源的巨大数量和异构性使资源管理成为一项极具挑战性的工作。资源管理场景通常包括资源发现、资源监控、资源清单、资源供应、故障隔离、各种自治功能和服务级别管理活动。因此,目前还没有一个系统可以作为一个完整的系统来处理网格中的所有故障。检查点是一种容错技术,可以快速恢复故障并重新启动作业。分布式系统的检查点算法已经被研究了很多年。这些算法可分为三类:协调算法、非协调算法和通信诱导算法。本文提出了一种与周期性检查点算法等价的检查点计数最小的检查点算法。在故障情况下的回滚距离相对较短,在任务完成时间方面优于其他算法,无论是在无故障情况下还是在故障情况下。该算法已在Alchemi中实现。. NET,因为它目前不支持
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Efficient checkpoint algorithm for distributed system
The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation and interactions between distributed services and resources, has led to the start of new era of computing. Tremendously large number and the heterogeneous nature of Grid Computing resource make the resource management a significantly challenging job. Resource management scenarios often include resource discovery, resource monitoring, resource inventories, resource provisioning, fault isolation, variety of autonomic capabilities and service level management activities. Out of this fault tolerance has become the main topic of research as till date there is no single system that can be called as the complete system that will handle all the faults in grids. Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, uncoordinated and communication-induced algorithms. In this paper, a checkpointing algorithm that has minimum checkpointing counts equivalent to periodic checkpointing algorithm has been proposed. For relatively short rollback distance at faulty situations and produces better performance rather than other algorithms in terms of task completion time, in both fault-free and faulty situations. This algorithm has been implemented in Alchemi.NET because it did not currently support any
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信