Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure

R. Govindan, Ina Minei, M. Kallahalla, B. Koley, Amin Vahdat
{"title":"Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure","authors":"R. Govindan, Ina Minei, M. Kallahalla, B. Koley, Amin Vahdat","doi":"10.1145/2934872.2934891","DOIUrl":null,"url":null,"abstract":"Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution and complexity. Little, however, is known about failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events in a global-scale content provider encompassing several data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are evenly distributed across different network types and planes, but that a large number of failures happen when a management operation is in progress within the network. We discuss some of these failures in detail, and also describe our design principles for high availability motivated by these failures, including using defense in depth, maintaining consistency across planes, failing open on large failures, carefully preventing and avoiding failures, and assessing root cause quickly. Our findings suggest that, as networks become more complicated, failures lurk everywhere, and, counter-intuitively, continuous incremental evolution of the network can, when applied together with our design principles, result in a more robust network.","PeriodicalId":284960,"journal":{"name":"Proceedings of the 2016 ACM SIGCOMM Conference","volume":"416 1-2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"205","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM SIGCOMM Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2934872.2934891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 205

Abstract

Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution and complexity. Little, however, is known about failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events in a global-scale content provider encompassing several data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are evenly distributed across different network types and planes, but that a large number of failures happen when a management operation is in progress within the network. We discuss some of these failures in detail, and also describe our design principles for high availability motivated by these failures, including using defense in depth, maintaining consistency across planes, failing open on large failures, carefully preventing and avoiding failures, and assessing root cause quickly. Our findings suggest that, as networks become more complicated, failures lurk everywhere, and, counter-intuitively, continuous incremental evolution of the network can, when applied together with our design principles, result in a more robust network.
进化或死亡:来自google网络基础设施的高可用性设计原则
面对规模、网络演进和复杂性,为内容提供商维持最高水平的可用性是一项挑战。然而,对于大型内容提供者容易出现的故障,以及它们采用什么机制来确保高可用性,所知甚少。通过对包含多个数据中心和两个wan的全球规模内容提供商的100多个高影响故障事件的详细分析,我们量化了可用性故障的几个维度。我们发现故障在不同的网络类型和平面上是均匀分布的,但在网络内进行管理操作时发生了大量故障。我们将详细讨论其中的一些故障,并描述由这些故障引起的高可用性的设计原则,包括使用深度防御、跨平面保持一致性、在大型故障时打开故障、仔细预防和避免故障以及快速评估根本原因。我们的研究结果表明,随着网络变得越来越复杂,故障无处不在,而且,与直觉相反,当与我们的设计原则一起应用时,网络的持续增量进化可以产生更强大的网络。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书