大规模分布式软件的可靠初始化

International Conference on Dependable Systems and Networks, 2004 Pub Date : 2004-06-28 DOI:10.1109/DSN.2004.1311903

J. Ren, R. Buskens, O. J. Gonzalez

{"title":"大规模分布式软件的可靠初始化","authors":"J. Ren, R. Buskens, O. J. Gonzalez","doi":"10.1109/DSN.2004.1311903","DOIUrl":null,"url":null,"abstract":"Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Dependable initialization of large-scale distributed software\",\"authors\":\"J. Ren, R. Buskens, O. J. Gonzalez\",\"doi\":\"10.1109/DSN.2004.1311903\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.\",\"PeriodicalId\":436323,\"journal\":{\"name\":\"International Conference on Dependable Systems and Networks, 2004\",\"volume\":\"74 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Dependable Systems and Networks, 2004\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN.2004.1311903\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Dependable Systems and Networks, 2004","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2004.1311903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

在容错计算方面，大多数有记录的工作都解决了从正常系统操作期间发生的故障中恢复的问题。要使系统能够首先开始执行其职责，需要系统成功地完成初始化。大规模的分布式系统可能需要几个小时来初始化。对于这样的系统，一个关键的挑战是容忍初始化过程中发生的故障，同时仍然及时完成初始化。在本文中，我们提出了一个可靠的初始化模型，该模型捕获了要初始化的系统的体系结构以及系统组件之间的相互依赖关系。我们表明，如果延迟恢复操作，而不是在检测到故障后立即开始恢复操作，则有时可以更快地完成整个系统初始化。这个观察结果引导我们引入一个恢复决策函数，它动态地评估何时采取恢复操作。然后，我们描述了一种可靠初始化算法，该算法将可靠初始化模型与恢复决策函数相结合，以实现快速初始化。实验结果表明，与传统的初始化算法相比，该算法的初始化开销更小。这项工作是我们意识到的第一次正式研究在存在故障的情况下初始化分布式系统的挑战的努力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dependable initialization of large-scale distributed software

Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Dependable Systems and Networks, 2004

自引率

0.00%

发文量