使用资源监控来选择恢复策略

Annual Symposium Reliability and Maintainability, 2004 - RAMS Pub Date : 2004-08-24 DOI:10.1109/RAMS.2004.1285459

R. Tirtea, Geert Deconinck, V. De Florio, R. Belmans

{"title":"使用资源监控来选择恢复策略","authors":"R. Tirtea, Geert Deconinck, V. De Florio, R. Belmans","doi":"10.1109/RAMS.2004.1285459","DOIUrl":null,"url":null,"abstract":"Distributed heterogeneous embedded systems involved in the control of infrastructures, such as electric power infrastructure, need to ensure reliable services regardless of faults and changes in the environment. A fault tolerance middleware architecture containing mechanisms for adaptation of quality-of-service (QoS) is developed to assure dependable control of the components of the infrastructure. Recovery strategies are used to allow reconfiguration of the system (e.g. graceful degradation) based on the circumstances of the failure. In this paper we present why and how available resources should be also considered together with the type of failure and the circumstances of the failure in the selection of recovery strategy. Changes in the environment such as lower resources at node levels (e.g. overload of the systems) or degradation of QoS (e.g. scarce of bandwidth in case of communication links) should be considered before allocating a new process/task to another host or before taking reconfiguration decisions. A mathematical model for generating a composite indicator based on sampled parameters is presented. The mechanism for monitoring resources at the node level is described and it is presented how this can be used in the selection of a recovery action (e.g. restart/migration of processes on/to overloaded nodes should be avoided). Also, based on the communication characteristics between distributed sites (e.g. depending availability or on cost), different recovery strategies can be selected. For this paper we consider the case of two recovery strategies and we present a mechanism for selecting the appropriate one. The fault-tolerant architecture integrating the QoS monitoring mechanism achieves dynamic reconfiguration of the recovery strategies based on the changes in the environment. Also, the QoS monitoring mechanism increases the differentiation between node crash and network problems for failure suspected nodes. Another advantage of using this mechanism is the dynamic adaptation of resource allocation for an overall increase in application availability.","PeriodicalId":270494,"journal":{"name":"Annual Symposium Reliability and Maintainability, 2004 - RAMS","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Using resource monitoring to select recovery strategies\",\"authors\":\"R. Tirtea, Geert Deconinck, V. De Florio, R. Belmans\",\"doi\":\"10.1109/RAMS.2004.1285459\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributed heterogeneous embedded systems involved in the control of infrastructures, such as electric power infrastructure, need to ensure reliable services regardless of faults and changes in the environment. A fault tolerance middleware architecture containing mechanisms for adaptation of quality-of-service (QoS) is developed to assure dependable control of the components of the infrastructure. Recovery strategies are used to allow reconfiguration of the system (e.g. graceful degradation) based on the circumstances of the failure. In this paper we present why and how available resources should be also considered together with the type of failure and the circumstances of the failure in the selection of recovery strategy. Changes in the environment such as lower resources at node levels (e.g. overload of the systems) or degradation of QoS (e.g. scarce of bandwidth in case of communication links) should be considered before allocating a new process/task to another host or before taking reconfiguration decisions. A mathematical model for generating a composite indicator based on sampled parameters is presented. The mechanism for monitoring resources at the node level is described and it is presented how this can be used in the selection of a recovery action (e.g. restart/migration of processes on/to overloaded nodes should be avoided). Also, based on the communication characteristics between distributed sites (e.g. depending availability or on cost), different recovery strategies can be selected. For this paper we consider the case of two recovery strategies and we present a mechanism for selecting the appropriate one. The fault-tolerant architecture integrating the QoS monitoring mechanism achieves dynamic reconfiguration of the recovery strategies based on the changes in the environment. Also, the QoS monitoring mechanism increases the differentiation between node crash and network problems for failure suspected nodes. Another advantage of using this mechanism is the dynamic adaptation of resource allocation for an overall increase in application availability.\",\"PeriodicalId\":270494,\"journal\":{\"name\":\"Annual Symposium Reliability and Maintainability, 2004 - RAMS\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annual Symposium Reliability and Maintainability, 2004 - RAMS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RAMS.2004.1285459\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Symposium Reliability and Maintainability, 2004 - RAMS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RAMS.2004.1285459","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

分布式异构嵌入式系统涉及对基础设施(如电力基础设施)的控制，需要在不受环境故障和变化影响的情况下确保可靠的服务。开发了一种容错中间件体系结构，其中包含服务质量(QoS)适应机制，以确保对基础结构组件的可靠控制。恢复策略用于允许基于故障情况对系统进行重新配置(例如，优雅降级)。在本文中，我们提出了为什么以及如何在选择恢复策略时还应考虑可用资源以及故障类型和故障情况。在将新进程/任务分配给另一台主机或在进行重新配置决策之前，应考虑环境的变化，例如节点级别的资源减少(例如系统过载)或QoS的退化(例如在通信链路的情况下带宽不足)。提出了一种基于采样参数生成复合指标的数学模型。描述了在节点级别监视资源的机制，并介绍了如何在选择恢复操作时使用该机制(例如，应避免在超载节点上/向超载节点上重新启动/迁移进程)。此外，基于分布式站点之间的通信特征(例如，取决于可用性或成本)，可以选择不同的恢复策略。在本文中，我们考虑了两种恢复策略的情况，并提出了一种选择合适的恢复策略的机制。容错体系结构集成了QoS监控机制，实现了根据环境变化动态重新配置恢复策略。此外，QoS监控机制增加了节点崩溃和疑似故障节点的网络问题之间的区别。使用这种机制的另一个优点是可以动态地调整资源分配，从而全面提高应用程序的可用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using resource monitoring to select recovery strategies

Distributed heterogeneous embedded systems involved in the control of infrastructures, such as electric power infrastructure, need to ensure reliable services regardless of faults and changes in the environment. A fault tolerance middleware architecture containing mechanisms for adaptation of quality-of-service (QoS) is developed to assure dependable control of the components of the infrastructure. Recovery strategies are used to allow reconfiguration of the system (e.g. graceful degradation) based on the circumstances of the failure. In this paper we present why and how available resources should be also considered together with the type of failure and the circumstances of the failure in the selection of recovery strategy. Changes in the environment such as lower resources at node levels (e.g. overload of the systems) or degradation of QoS (e.g. scarce of bandwidth in case of communication links) should be considered before allocating a new process/task to another host or before taking reconfiguration decisions. A mathematical model for generating a composite indicator based on sampled parameters is presented. The mechanism for monitoring resources at the node level is described and it is presented how this can be used in the selection of a recovery action (e.g. restart/migration of processes on/to overloaded nodes should be avoided). Also, based on the communication characteristics between distributed sites (e.g. depending availability or on cost), different recovery strategies can be selected. For this paper we consider the case of two recovery strategies and we present a mechanism for selecting the appropriate one. The fault-tolerant architecture integrating the QoS monitoring mechanism achieves dynamic reconfiguration of the recovery strategies based on the changes in the environment. Also, the QoS monitoring mechanism increases the differentiation between node crash and network problems for failure suspected nodes. Another advantage of using this mechanism is the dynamic adaptation of resource allocation for an overall increase in application availability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annual Symposium Reliability and Maintainability, 2004 - RAMS

自引率

0.00%

发文量