Examining Raft's behaviour during partial network failures

Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems Pub Date : 2021-04-26 DOI:10.1145/3447851.3458739

C. Jensen, H. Howard, R. Mortier

引用次数: 4

Abstract

State machine replication protocols such as Raft are widely used to build highly-available strongly-consistent services, maintaining liveness even if a minority of servers crash. As these systems are implemented and optimised for production, they accumulate many divergences from the original specification. These divergences are poorly documented, resulting in operators having an incomplete model of the system's characteristics, especially during failures. In this paper, we look at one such Raft model used to explain the November Cloudflare outage and show that etcd's behaviour during the same failure differs. We continue to show the specific optimisations in etcd causing this difference and present a more complete model of the outage based on etcd's behaviour in an emulated deployment using reckon. Finally, we highlight the upcoming PreVote optimisation in etcd, which might have prevented the outage from happening in the first place.

查看原文本刊更多论文

检查筏的行为在部分网络故障

像Raft这样的状态机复制协议被广泛用于构建高可用性、强一致性的服务，即使少数服务器崩溃也能保持活跃。随着这些系统的实施和生产优化，它们与原始规范积累了许多分歧。这些差异没有得到很好的记录，导致操作人员对系统特性的模型不完整，尤其是在发生故障时。在本文中，我们研究了一个用于解释11月Cloudflare中断的Raft模型，并展示了etcd在相同故障期间的行为不同。我们将继续展示etcd中导致这种差异的特定优化，并根据etcd在使用估算的模拟部署中的行为，给出一个更完整的停机模型。最后，我们强调了etcd中即将到来的PreVote优化，这可能首先阻止了中断的发生。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems

自引率

0.00%

发文量