Application-transparent process-level error recovery for multicomputers

Y. Tamir, T. Frazier
{"title":"Application-transparent process-level error recovery for multicomputers","authors":"Y. Tamir, T. Frazier","doi":"10.1109/HICSS.1989.47170","DOIUrl":null,"url":null,"abstract":"An application-transparent, process-level, distributed error recovery scheme for multicomputers is proposed. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes that are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery session may be active simultaneously. The scheme does not require significant overhead during normal operation, since it is not necessary to make message transmission atomic, acknowledge each message, or transmit checkbits with each packet. Variations of the technique using packet-switching or virtual circuits are discussed, and the scheme is compared to previously published techniques.<<ETX>>","PeriodicalId":300182,"journal":{"name":"[1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"[1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HICSS.1989.47170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18

Abstract

An application-transparent, process-level, distributed error recovery scheme for multicomputers is proposed. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes that are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery session may be active simultaneously. The scheme does not require significant overhead during normal operation, since it is not necessary to make message transmission atomic, acknowledge each message, or transmit checkbits with each packet. Variations of the technique using packet-switching or virtual circuits are discussed, and the scheme is compared to previously published techniques.<>
多台计算机的应用程序透明进程级错误恢复
提出了一种应用透明、进程级、分布式的多机错误恢复方案。检查点由计时器启动,时间间隔由应用程序的需要决定。检查点和恢复只涉及系统中必要的部分:一组交互过程。不属于交互集的进程不参与检查点或恢复,并继续执行有用的工作。多个检查点和/或恢复会话可能同时处于活动状态。该方案在正常操作期间不需要大量开销,因为不需要使消息传输原子化、确认每条消息或与每个数据包一起传输校验位。讨论了使用分组交换或虚拟电路的技术变体,并将该方案与先前发表的技术进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信