Components for software fault tolerance and rejuvenation

Yennun Huang, C. Kintala, L. Bernstein, Yi-Min Wang
{"title":"Components for software fault tolerance and rejuvenation","authors":"Yennun Huang, C. Kintala, L. Bernstein, Yi-Min Wang","doi":"10.15325/ATTTJ.1996.6771126","DOIUrl":null,"url":null,"abstract":"Software fault tolerance is the task of detecting and recovering from failures that are not handled in the underlying hardware or operating system layers of an application. Software rejuvenation prevents failures by periodically, and gracefully, terminating an application and restarting it at a clean internal state. This paper describes five reusable software components that provide these capabilities. They perform automatic detection and restart of failed processes, checkpointing and recovery of data in memory, replication and synchronization of files, and software rejuvenation. These components, which have been ported to a number of UNIX∗ platforms, can be used in any application with minimal programming effort. The fault tolerance capabilities of several communication products and services in AT&T have been enhanced by incorporating these components. Experience with these products to date indicates that the components provide efficient, economical means to increase the level of fault tolerance in an application.","PeriodicalId":135932,"journal":{"name":"AT&T Tech. J.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1996-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AT&T Tech. J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15325/ATTTJ.1996.6771126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

Software fault tolerance is the task of detecting and recovering from failures that are not handled in the underlying hardware or operating system layers of an application. Software rejuvenation prevents failures by periodically, and gracefully, terminating an application and restarting it at a clean internal state. This paper describes five reusable software components that provide these capabilities. They perform automatic detection and restart of failed processes, checkpointing and recovery of data in memory, replication and synchronization of files, and software rejuvenation. These components, which have been ported to a number of UNIX∗ platforms, can be used in any application with minimal programming effort. The fault tolerance capabilities of several communication products and services in AT&T have been enhanced by incorporating these components. Experience with these products to date indicates that the components provide efficient, economical means to increase the level of fault tolerance in an application.
软件容错和恢复组件
软件容错是检测和恢复在应用程序的底层硬件或操作系统层中未处理的故障的任务。软件再生通过定期、优雅地终止应用程序并在干净的内部状态下重新启动它来防止故障。本文描述了提供这些功能的五个可重用软件组件。它们执行失败进程的自动检测和重启、内存中数据的检查点和恢复、文件的复制和同步以及软件恢复。这些组件已移植到许多UNIX *平台,可以在任何应用程序中使用,只需最少的编程工作。通过合并这些组件,AT&T的一些通信产品和服务的容错能力得到了增强。迄今为止使用这些产品的经验表明,这些组件提供了高效、经济的方法来提高应用程序中的容错水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信