Yennun Huang, C. Kintala, L. Bernstein, Yi-Min Wang
{"title":"Components for software fault tolerance and rejuvenation","authors":"Yennun Huang, C. Kintala, L. Bernstein, Yi-Min Wang","doi":"10.15325/ATTTJ.1996.6771126","DOIUrl":null,"url":null,"abstract":"Software fault tolerance is the task of detecting and recovering from failures that are not handled in the underlying hardware or operating system layers of an application. Software rejuvenation prevents failures by periodically, and gracefully, terminating an application and restarting it at a clean internal state. This paper describes five reusable software components that provide these capabilities. They perform automatic detection and restart of failed processes, checkpointing and recovery of data in memory, replication and synchronization of files, and software rejuvenation. These components, which have been ported to a number of UNIX∗ platforms, can be used in any application with minimal programming effort. The fault tolerance capabilities of several communication products and services in AT&T have been enhanced by incorporating these components. Experience with these products to date indicates that the components provide efficient, economical means to increase the level of fault tolerance in an application.","PeriodicalId":135932,"journal":{"name":"AT&T Tech. J.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1996-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AT&T Tech. J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15325/ATTTJ.1996.6771126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
Abstract
Software fault tolerance is the task of detecting and recovering from failures that are not handled in the underlying hardware or operating system layers of an application. Software rejuvenation prevents failures by periodically, and gracefully, terminating an application and restarting it at a clean internal state. This paper describes five reusable software components that provide these capabilities. They perform automatic detection and restart of failed processes, checkpointing and recovery of data in memory, replication and synchronization of files, and software rejuvenation. These components, which have been ported to a number of UNIX∗ platforms, can be used in any application with minimal programming effort. The fault tolerance capabilities of several communication products and services in AT&T have been enhanced by incorporating these components. Experience with these products to date indicates that the components provide efficient, economical means to increase the level of fault tolerance in an application.