SWIM: scalable weakly-consistent infection-style process group membership protocol

Abhinandan Das, Indranil Gupta, Ashish Motivala
{"title":"SWIM: scalable weakly-consistent infection-style process group membership protocol","authors":"Abhinandan Das, Indranil Gupta, Ashish Motivala","doi":"10.1109/DSN.2002.1028914","DOIUrl":null,"url":null,"abstract":"Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs. Unlike traditional heart beating protocols, SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. Both the expected time to first detection of each process failure, and the expected message load per member do not vary with group size. Information about membership changes, such as process joins, drop-outs and failures, is propagated via piggybacking on ping messages and acknowledgments. This results in a robust and fast infection style (also epidemic or gossip-style) of dissemination. The rate of false failure detections in the SWIM system is reduced by modifying the protocol to allow group members to suspect a process before declaring it as failed - this allows the system to discover and rectify false failure detections. Finally, the protocol guarantees a deterministic time bound to detect failures. Experimental results from the SWIM prototype are presented. We discuss the extensibility of the design to a WAN-wide scale.","PeriodicalId":93807,"journal":{"name":"Proceedings. International Conference on Dependable Systems and Networks","volume":"6 1","pages":"303-312"},"PeriodicalIF":0.0000,"publicationDate":"2002-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"176","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Conference on Dependable Systems and Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2002.1028914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 176

Abstract

Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs. Unlike traditional heart beating protocols, SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. Both the expected time to first detection of each process failure, and the expected message load per member do not vary with group size. Information about membership changes, such as process joins, drop-outs and failures, is propagated via piggybacking on ping messages and acknowledgments. This results in a robust and fast infection style (also epidemic or gossip-style) of dissemination. The rate of false failure detections in the SWIM system is reduced by modifying the protocol to allow group members to suspect a process before declaring it as failed - this allows the system to discover and rectify false failure detections. Finally, the protocol guarantees a deterministic time bound to detect failures. Experimental results from the SWIM prototype are presented. We discuss the extensibility of the design to a WAN-wide scale.
SWIM:可伸缩的弱一致感染式进程组成员协议
几个分布式点对点应用程序需要所有参与过程的过程组成员信息的弱一致知识。SWIM是一个通用的软件模块,为大型过程组提供此服务。SWIM工作的动机是传统心跳协议的不可扩展性,它要么施加网络负载,使其随组大小呈二次增长,要么损害响应时间或误报频率,以检测进程崩溃。本文介绍了一个大型商用pc机集群上的SWIM子系统的设计、实现和性能。与传统的心跳协议不同,SWIM分离了成员协议的故障检测和成员更新传播功能。进程通过有效的点对点周期性随机探测协议进行监控。首次检测每个流程故障的预期时间和每个成员的预期消息负载都不随组大小而变化。有关成员关系更改的信息(如进程连接、退出和失败)通过附带ping消息和确认来传播。这导致了一种强大而快速的感染方式(也称为流行病或八卦式)传播。在SWIM系统中,通过修改协议,允许组成员在宣布一个进程失败之前怀疑它,从而降低了错误故障检测的率——这允许系统发现并纠正错误的故障检测。最后,该协议保证了检测故障的确定性时间范围。给出了SWIM原型机的实验结果。我们讨论了该设计在广域网范围内的可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信