Proactive fault-management in software systems

Proceedings 33rd Annual Simulation Symposium (SS 2000) Pub Date : 2000-04-16 DOI:10.1109/SIMSYM.2000.844894

Kishor S. Trivedi

{"title":"Proactive fault-management in software systems","authors":"Kishor S. Trivedi","doi":"10.1109/SIMSYM.2000.844894","DOIUrl":null,"url":null,"abstract":"Hardware redundancy is a time-honored technique to enhance reliability. However, when applied to software systems, it is inherently expensive to implement due to the need to employ design diversity. Furthermore, recent studies have reported the transient nature of software failures for which design diversity is not very helpful. Transient failures typically occur because of design faults in software, which result in unacceptable erroneous states in the OS environment of the process. Hence, environment diversity, a generalization of system restart, has been proposed as a cheap yet effective technique for software fault-tolerance. The basic idea here is to modify the operating environment of the running process. Recently, the phenomenon of ?software aging?, one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Software aging has been reported in widely used software like Netscape and xrn. Aging in the AT\\&T telecommunication software has known to have resulted in packet loss. Numerous other examples exist, in systems with high availability requirements and also in safety-critical systems. To counteract this phenomenon, a proactive approach of fault management, called ?software rejuvenation? has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. The preventive action can be done at optimal times (for example, when the load on the system is low) so that the overhead due to planned system downtime is minimal. This method therefore avoids unplanned and potentially expensive system outages due to software aging. A basic assumption here is that the overhead involved in the planned downtime and performing the clean-up operation is considerably less than the cost incurred due to unplanned system outages.In this talk, we will discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation, for different scenarios. Developing stochastic models which tradeoff the cost of unexpected failures due to software aging with the overhead of proactive fault management does this. We use a Markov regenerative process model with a subordinated non-homogeneous Markov chain. The stochastic models have both theoretical and practical value. Depending on the failure characteristics of the software and the preventive maintenance policies, the appropriate model can be used to obtain optimal rejuvenation intervals based on several criteria. The second half the talk will deal with measurement-based models, which are constructed using workload and resource usage data collected from the UNIX operating system over a period. Methodologies based on statistics and Markov models are used to detect software aging and to estimate its effect on various system resources. The measurement-based models are the first steps towards predicting aging related failures, intended to help development of strategies for software rejuvenation triggered by actual measurements.","PeriodicalId":361153,"journal":{"name":"Proceedings 33rd Annual Simulation Symposium (SS 2000)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2000-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 33rd Annual Simulation Symposium (SS 2000)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIMSYM.2000.844894","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Hardware redundancy is a time-honored technique to enhance reliability. However, when applied to software systems, it is inherently expensive to implement due to the need to employ design diversity. Furthermore, recent studies have reported the transient nature of software failures for which design diversity is not very helpful. Transient failures typically occur because of design faults in software, which result in unacceptable erroneous states in the OS environment of the process. Hence, environment diversity, a generalization of system restart, has been proposed as a cheap yet effective technique for software fault-tolerance. The basic idea here is to modify the operating environment of the running process. Recently, the phenomenon of ?software aging?, one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Software aging has been reported in widely used software like Netscape and xrn. Aging in the AT\&T telecommunication software has known to have resulted in packet loss. Numerous other examples exist, in systems with high availability requirements and also in safety-critical systems. To counteract this phenomenon, a proactive approach of fault management, called ?software rejuvenation? has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. The preventive action can be done at optimal times (for example, when the load on the system is low) so that the overhead due to planned system downtime is minimal. This method therefore avoids unplanned and potentially expensive system outages due to software aging. A basic assumption here is that the overhead involved in the planned downtime and performing the clean-up operation is considerably less than the cost incurred due to unplanned system outages.In this talk, we will discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation, for different scenarios. Developing stochastic models which tradeoff the cost of unexpected failures due to software aging with the overhead of proactive fault management does this. We use a Markov regenerative process model with a subordinated non-homogeneous Markov chain. The stochastic models have both theoretical and practical value. Depending on the failure characteristics of the software and the preventive maintenance policies, the appropriate model can be used to obtain optimal rejuvenation intervals based on several criteria. The second half the talk will deal with measurement-based models, which are constructed using workload and resource usage data collected from the UNIX operating system over a period. Methodologies based on statistics and Markov models are used to detect software aging and to estimate its effect on various system resources. The measurement-based models are the first steps towards predicting aging related failures, intended to help development of strategies for software rejuvenation triggered by actual measurements.

查看原文本刊更多论文

软件系统中的主动故障管理

硬件冗余是一种历史悠久的提高可靠性的技术。然而，当应用于软件系统时，由于需要采用设计多样性，它的实现本质上是昂贵的。此外，最近的研究报告了软件故障的短暂性，设计多样性对其没有太大帮助。瞬态故障通常是由于软件中的设计错误造成的，这会导致进程的操作系统环境中出现不可接受的错误状态。因此，环境多样性作为系统重启的一种概括，被认为是一种廉价而有效的软件容错技术。这里的基本思想是修改运行进程的操作环境。近年来，软件老化现象日益严重。在这种情况下，软件系统的状态会随着时间的推移而退化。这种退化的主要原因是操作系统资源耗尽、数据损坏和数值错误积累。这可能最终导致软件性能下降或崩溃/挂起失败，或两者兼而有之。在Netscape和xrn等广泛使用的软件中已经报告了软件老化。据悉，at&t电信软件的老化导致了数据包的丢失。在具有高可用性需求的系统和安全关键型系统中，还存在许多其他示例。为了抵消这种现象，一种主动的故障管理方法，称为“软件复兴”。已经提出。这本质上涉及优雅地终止应用程序或系统，并在干净的内部状态下重新启动它。此过程可以清除累积的错误并释放操作系统资源。可以在最佳时间(例如，当系统负载较低时)执行预防措施，以便将计划的系统停机所造成的开销降至最低。因此，这种方法避免了由于软件老化而导致的计划外和潜在的昂贵系统中断。这里的一个基本假设是，计划的停机时间和执行清理操作所涉及的开销远远小于由于计划外系统中断而产生的成本。在本次演讲中，我们将讨论在运行软件系统中评估主动故障管理有效性的方法，并确定针对不同场景执行恢复的最佳时间。开发随机模型，在软件老化导致的意外故障成本与主动故障管理的开销之间进行权衡，可以做到这一点。我们使用了一个具有从属非齐次马尔可夫链的马尔可夫再生过程模型。随机模型具有一定的理论和实用价值。根据软件的故障特征和预防性维护策略，可以使用适当的模型来获得基于几个标准的最优恢复间隔。演讲的后半部分将讨论基于度量的模型，这些模型是使用一段时间内从UNIX操作系统收集的工作负载和资源使用数据构建的。采用基于统计和马尔可夫模型的方法检测软件老化并估计其对各种系统资源的影响。基于测量的模型是预测与老化相关的故障的第一步，旨在帮助开发由实际测量触发的软件复兴策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 33rd Annual Simulation Symposium (SS 2000)

自引率

0.00%

发文量