{"title":"Proactive fault-management in software systems","authors":"Kishor S. Trivedi","doi":"10.1109/SIMSYM.2000.844894","DOIUrl":null,"url":null,"abstract":"Hardware redundancy is a time-honored technique to enhance reliability. However, when applied to software systems, it is inherently expensive to implement due to the need to employ design diversity. Furthermore, recent studies have reported the transient nature of software failures for which design diversity is not very helpful. Transient failures typically occur because of design faults in software, which result in unacceptable erroneous states in the OS environment of the process. Hence, environment diversity, a generalization of system restart, has been proposed as a cheap yet effective technique for software fault-tolerance. The basic idea here is to modify the operating environment of the running process. Recently, the phenomenon of ?software aging?, one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Software aging has been reported in widely used software like Netscape and xrn. Aging in the AT\\&T telecommunication software has known to have resulted in packet loss. Numerous other examples exist, in systems with high availability requirements and also in safety-critical systems. To counteract this phenomenon, a proactive approach of fault management, called ?software rejuvenation? has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. The preventive action can be done at optimal times (for example, when the load on the system is low) so that the overhead due to planned system downtime is minimal. This method therefore avoids unplanned and potentially expensive system outages due to software aging. A basic assumption here is that the overhead involved in the planned downtime and performing the clean-up operation is considerably less than the cost incurred due to unplanned system outages.In this talk, we will discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation, for different scenarios. Developing stochastic models which tradeoff the cost of unexpected failures due to software aging with the overhead of proactive fault management does this. We use a Markov regenerative process model with a subordinated non-homogeneous Markov chain. The stochastic models have both theoretical and practical value. Depending on the failure characteristics of the software and the preventive maintenance policies, the appropriate model can be used to obtain optimal rejuvenation intervals based on several criteria. The second half the talk will deal with measurement-based models, which are constructed using workload and resource usage data collected from the UNIX operating system over a period. Methodologies based on statistics and Markov models are used to detect software aging and to estimate its effect on various system resources. The measurement-based models are the first steps towards predicting aging related failures, intended to help development of strategies for software rejuvenation triggered by actual measurements.","PeriodicalId":361153,"journal":{"name":"Proceedings 33rd Annual Simulation Symposium (SS 2000)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2000-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 33rd Annual Simulation Symposium (SS 2000)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIMSYM.2000.844894","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Hardware redundancy is a time-honored technique to enhance reliability. However, when applied to software systems, it is inherently expensive to implement due to the need to employ design diversity. Furthermore, recent studies have reported the transient nature of software failures for which design diversity is not very helpful. Transient failures typically occur because of design faults in software, which result in unacceptable erroneous states in the OS environment of the process. Hence, environment diversity, a generalization of system restart, has been proposed as a cheap yet effective technique for software fault-tolerance. The basic idea here is to modify the operating environment of the running process. Recently, the phenomenon of ?software aging?, one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Software aging has been reported in widely used software like Netscape and xrn. Aging in the AT\&T telecommunication software has known to have resulted in packet loss. Numerous other examples exist, in systems with high availability requirements and also in safety-critical systems. To counteract this phenomenon, a proactive approach of fault management, called ?software rejuvenation? has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. The preventive action can be done at optimal times (for example, when the load on the system is low) so that the overhead due to planned system downtime is minimal. This method therefore avoids unplanned and potentially expensive system outages due to software aging. A basic assumption here is that the overhead involved in the planned downtime and performing the clean-up operation is considerably less than the cost incurred due to unplanned system outages.In this talk, we will discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation, for different scenarios. Developing stochastic models which tradeoff the cost of unexpected failures due to software aging with the overhead of proactive fault management does this. We use a Markov regenerative process model with a subordinated non-homogeneous Markov chain. The stochastic models have both theoretical and practical value. Depending on the failure characteristics of the software and the preventive maintenance policies, the appropriate model can be used to obtain optimal rejuvenation intervals based on several criteria. The second half the talk will deal with measurement-based models, which are constructed using workload and resource usage data collected from the UNIX operating system over a period. Methodologies based on statistics and Markov models are used to detect software aging and to estimate its effect on various system resources. The measurement-based models are the first steps towards predicting aging related failures, intended to help development of strategies for software rejuvenation triggered by actual measurements.