Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation Pub Date : 2018-10-08 DOI:10.5555/3291168.3291197

R. Alagappan, Aishwarya Ganesan, Jing Liu, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

引用次数: 13

Abstract

We introduce situation-aware updates and crash recovery (SAUCR), a new approach to performing replicated data updates in a distributed system. SAUCR adapts the update protocol to the current situation: with many nodes up, SAUCR buffers updates in memory; when failures arise, SAUCR flushes updates to disk. This situation-awareness enables SAUCR to achieve high performance while offering strong durability and availability guarantees. We implement a prototype of SAUCR in ZooKeeper. Through rigorous crash testing, we demonstrate that SAUCR significantly improves durability and availability compared to systems that always write only to memory. We also show that SAUCR's reliability improvements come at little or no cost: SAUCR's overheads are within 0%-9% of a purely memory-based system.

查看原文本刊更多论文

容错，快与慢:利用分布式系统中的故障异步

我们介绍了态势感知更新和崩溃恢复(SAUCR)，这是一种在分布式系统中执行复制数据更新的新方法。SAUCR使更新协议适应当前情况:当有许多节点时，SAUCR在内存中缓冲更新;当出现故障时，SAUCR将更新刷新到磁盘。这种态势感知使SAUCR能够实现高性能，同时提供强大的耐用性和可用性保证。我们在ZooKeeper中实现了一个SAUCR的原型。通过严格的崩溃测试，我们证明了与总是只写内存的系统相比，SAUCR显著提高了持久性和可用性。我们还表明，SAUCR的可靠性改进几乎没有成本:SAUCR的开销在纯基于内存的系统的0%-9%之内。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

自引率

0.00%

发文量