大规模分布式软件的可靠初始化

J. Ren, R. Buskens, O. J. Gonzalez
{"title":"大规模分布式软件的可靠初始化","authors":"J. Ren, R. Buskens, O. J. Gonzalez","doi":"10.1109/DSN.2004.1311903","DOIUrl":null,"url":null,"abstract":"Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Dependable initialization of large-scale distributed software\",\"authors\":\"J. Ren, R. Buskens, O. J. Gonzalez\",\"doi\":\"10.1109/DSN.2004.1311903\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.\",\"PeriodicalId\":436323,\"journal\":{\"name\":\"International Conference on Dependable Systems and Networks, 2004\",\"volume\":\"74 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Dependable Systems and Networks, 2004\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN.2004.1311903\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Dependable Systems and Networks, 2004","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2004.1311903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

在容错计算方面,大多数有记录的工作都解决了从正常系统操作期间发生的故障中恢复的问题。要使系统能够首先开始执行其职责,需要系统成功地完成初始化。大规模的分布式系统可能需要几个小时来初始化。对于这样的系统,一个关键的挑战是容忍初始化过程中发生的故障,同时仍然及时完成初始化。在本文中,我们提出了一个可靠的初始化模型,该模型捕获了要初始化的系统的体系结构以及系统组件之间的相互依赖关系。我们表明,如果延迟恢复操作,而不是在检测到故障后立即开始恢复操作,则有时可以更快地完成整个系统初始化。这个观察结果引导我们引入一个恢复决策函数,它动态地评估何时采取恢复操作。然后,我们描述了一种可靠初始化算法,该算法将可靠初始化模型与恢复决策函数相结合,以实现快速初始化。实验结果表明,与传统的初始化算法相比,该算法的初始化开销更小。这项工作是我们意识到的第一次正式研究在存在故障的情况下初始化分布式系统的挑战的努力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dependable initialization of large-scale distributed software
Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信