Handling Crash and Software Faults Efficiently in Distributed Event Stream Processing

Andrey Brito, Stefan Weigert, Martin Süßkraut, C. Fetzer, P. Felber
{"title":"Handling Crash and Software Faults Efficiently in Distributed Event Stream Processing","authors":"Andrey Brito, Stefan Weigert, Martin Süßkraut, C. Fetzer, P. Felber","doi":"10.1109/DEPEND.2010.32","DOIUrl":null,"url":null,"abstract":"Active replication is a common approach to handle failures in distributed systems, including Event Stream Processing (ESP) systems. However, one weakness of conventional active replication is that replicas, being equal and in the same state, are susceptible to common-mode crashes due to software bugs. We propose a new approach to active replication that assumes a failure model stronger than fail-stop but weaker than models permitting arbitrary failures. We combine transactional memory and extended runtime checking to achieve: (i) low processing latency in failure-free runs by allowing downstream nodes to use speculative results and, thus, to circumvent the overhead added by the extended runtime checks; (ii) reduce the MTTR by enabling localized rollbacks (with word granularity) in several cases. We show that major limitations of n-variant active replication (e.g., multi-threading support, complex and slow recovery) can be overcome and tolerance to software bugs is orthogonal to Byzantine fault tolerance.","PeriodicalId":447746,"journal":{"name":"2010 Third International Conference on Dependability","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Third International Conference on Dependability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEPEND.2010.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Active replication is a common approach to handle failures in distributed systems, including Event Stream Processing (ESP) systems. However, one weakness of conventional active replication is that replicas, being equal and in the same state, are susceptible to common-mode crashes due to software bugs. We propose a new approach to active replication that assumes a failure model stronger than fail-stop but weaker than models permitting arbitrary failures. We combine transactional memory and extended runtime checking to achieve: (i) low processing latency in failure-free runs by allowing downstream nodes to use speculative results and, thus, to circumvent the overhead added by the extended runtime checks; (ii) reduce the MTTR by enabling localized rollbacks (with word granularity) in several cases. We show that major limitations of n-variant active replication (e.g., multi-threading support, complex and slow recovery) can be overcome and tolerance to software bugs is orthogonal to Byzantine fault tolerance.
分布式事件流处理中崩溃和软件故障的有效处理
主动复制是处理分布式系统(包括事件流处理(Event Stream Processing, ESP)系统)故障的常用方法。然而,传统活动复制的一个缺点是,由于副本是相等的且处于相同的状态,因此容易由于软件错误而导致共模崩溃。我们提出了一种主动复制的新方法,该方法假设故障模型比故障停止强,但比允许任意故障的模型弱。我们将事务性内存和扩展运行时检查相结合,以实现:(i)通过允许下游节点使用推测结果,从而在无故障运行中降低处理延迟,从而规避扩展运行时检查所增加的开销;(ii)通过在几种情况下启用本地化回滚(具有单词粒度)来减少MTTR。我们证明了n变量主动复制的主要限制(例如,多线程支持,复杂和缓慢的恢复)是可以克服的,并且对软件错误的容忍度与拜占庭容错是正交的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信