A Large Scale Study of Data Center Network Reliability

Proceedings of the Internet Measurement Conference 2018 Pub Date : 2018-10-31 DOI:10.1145/3278532.3278566

Justin Meza, Tianyin Xu, K. Veeraraghavan, O. Mutlu

{"title":"A Large Scale Study of Data Center Network Reliability","authors":"Justin Meza, Tianyin Xu, K. Veeraraghavan, O. Mutlu","doi":"10.1145/3278532.3278566","DOIUrl":null,"url":null,"abstract":"The ability to tolerate, remediate, and recover from network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness requires system architects, software developers, and site operators to have a deep understanding of network reliability at scale, along with its implications on the software systems that run in data centers. Unfortunately, little has been reported on the reliability characteristics of large scale data center network infrastructure, let alone its impact on the availability of services powered by software running on that network infrastructure. This paper fills the gap by presenting a large scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world. Our study covers reliability characteristics of both intra and inter data center networks. For intra data center networks, we study seven years of operation data comprising thousands of network incidents across two different data center network designs, a cluster network design and a state-of-the-art fabric network design. For inter data center networks, we study eighteen months of recent repair tickets from the field to understand reliability of Wide Area Network (WAN) backbones. In contrast to prior work, we study the effects of network reliability on software systems, and how these reliability characteristics evolve over time. We discuss the implications of network reliability on the design, implementation, and operation of large scale data center systems and how it affects highly-available web services. We hope our study forms a foundation for understanding the reliability of large scale network infrastructure, and inspires new reliability solutions to network incidents.","PeriodicalId":20640,"journal":{"name":"Proceedings of the Internet Measurement Conference 2018","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"64","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Internet Measurement Conference 2018","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3278532.3278566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 64

Abstract

The ability to tolerate, remediate, and recover from network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness requires system architects, software developers, and site operators to have a deep understanding of network reliability at scale, along with its implications on the software systems that run in data centers. Unfortunately, little has been reported on the reliability characteristics of large scale data center network infrastructure, let alone its impact on the availability of services powered by software running on that network infrastructure. This paper fills the gap by presenting a large scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world. Our study covers reliability characteristics of both intra and inter data center networks. For intra data center networks, we study seven years of operation data comprising thousands of network incidents across two different data center network designs, a cluster network design and a state-of-the-art fabric network design. For inter data center networks, we study eighteen months of recent repair tickets from the field to understand reliability of Wide Area Network (WAN) backbones. In contrast to prior work, we study the effects of network reliability on software systems, and how these reliability characteristics evolve over time. We discuss the implications of network reliability on the design, implementation, and operation of large scale data center systems and how it affects highly-available web services. We hope our study forms a foundation for understanding the reliability of large scale network infrastructure, and inspires new reliability solutions to network incidents.

查看原文本刊更多论文

数据中心网络可靠性大规模研究

容忍、修复和从网络事件(例如由设备故障和光纤切断引起的)中恢复的能力对于构建和操作高可用性web服务至关重要。实现容错和故障准备需要系统架构师、软件开发人员和站点操作员对大规模的网络可靠性及其对在数据中心中运行的软件系统的影响有深刻的理解。不幸的是，关于大型数据中心网络基础设施的可靠性特征的报道很少，更不用说它对运行在该网络基础设施上的软件支持的服务可用性的影响了。本文通过对数据中心网络可靠性的大规模纵向研究来填补这一空白，该研究基于从Facebook(世界上最大的网络服务提供商之一)的生产网络基础设施收集的运营数据。我们的研究涵盖了数据中心内和数据中心间网络的可靠性特征。对于内部数据中心网络，我们研究了7年的运行数据，包括两种不同的数据中心网络设计，集群网络设计和最先进的结构网络设计中的数千个网络事件。对于跨数据中心网络，我们研究了18个月的现场维修单，以了解广域网(WAN)主干网的可靠性。与之前的工作相反，我们研究了网络可靠性对软件系统的影响，以及这些可靠性特征如何随着时间的推移而演变。我们讨论了网络可靠性对大型数据中心系统的设计、实现和操作的影响，以及它如何影响高可用性web服务。我们希望我们的研究能够为理解大规模网络基础设施的可靠性奠定基础，并启发新的网络事件可靠性解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Internet Measurement Conference 2018

自引率

0.00%

发文量