An Analysis of Network-Partitioning Failures in Cloud Systems

Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, S. Al-Kiswany
{"title":"An Analysis of Network-Partitioning Failures in Cloud Systems","authors":"Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, S. Al-Kiswany","doi":"10.5555/3291168.3291173","DOIUrl":null,"url":null,"abstract":"We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms.We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"56 1","pages":"51-68"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"52","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5555/3291168.3291173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 52

Abstract

We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms.We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures.
云系统中网络分区故障分析
我们提出了一个全面的研究136系统故障归因于网络分区故障从25个广泛使用的分布式系统。我们发现,大多数故障都会导致灾难性的影响,例如数据丢失、删除的数据重新出现、锁被打破和系统崩溃。一旦发生网络分区,大多数故障很容易出现:它们几乎不需要客户端输入,可以通过隔离单个节点触发,并且是确定性的。然而,必须考虑的测试用例数量是非常大的。幸运的是,我们确定了顺序、时序和网络故障特征,这些特征显著地简化了测试。此外,我们发现大量的失败是由于核心系统机制的设计缺陷。我们发现,通过设计审查可以避免大多数故障,并且可以通过使用网络分区故障注入进行测试来发现这些故障。我们构建了一个测试框架NEAT,它简化了多个客户端的协调,并可以注入不同类型的网络分区错误。我们使用NEAT测试了7个流行的系统,发现并报告了32个故障。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信