Surviving switch failures in cloud datacenters

IF 2.2 4区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Rachee Singh, Muqeet Mukhtar, Ashay Krishna, Aniruddha Parkhi, J. Padhye, D. Maltz
{"title":"Surviving switch failures in cloud datacenters","authors":"Rachee Singh, Muqeet Mukhtar, Ashay Krishna, Aniruddha Parkhi, J. Padhye, D. Maltz","doi":"10.1145/3464994.3464996","DOIUrl":null,"url":null,"abstract":"Switch failures can hamper access to client services, cause link congestion and blackhole network traffic. In this study, we examine the nature of switch failures in the datacenters of a large commercial cloud provider through the lens of survival theory. We study a cohort of over 180,000 switches with a variety of hardware and software configurations and find that datacenter switches have a 98% likelihood of functioning uninterrupted for over 3 months since deployment in production. However, there is significant heterogeneity in switch survival rates with respect to their hardware and software: the switches of one vendor are twice as likely to fail compared to the others. We attribute the majority of switch failures to hardware impairments and unplanned power losses. We find that the in-house switch operating system, SONiC, boosts the survival likelihood of switches in datacenters by 1% by eliminating switch failures caused by software bugs in vendor switch OSes.","PeriodicalId":50646,"journal":{"name":"ACM Sigcomm Computer Communication Review","volume":"12 1","pages":"2 - 9"},"PeriodicalIF":2.2000,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Sigcomm Computer Communication Review","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3464994.3464996","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 13

Abstract

Switch failures can hamper access to client services, cause link congestion and blackhole network traffic. In this study, we examine the nature of switch failures in the datacenters of a large commercial cloud provider through the lens of survival theory. We study a cohort of over 180,000 switches with a variety of hardware and software configurations and find that datacenter switches have a 98% likelihood of functioning uninterrupted for over 3 months since deployment in production. However, there is significant heterogeneity in switch survival rates with respect to their hardware and software: the switches of one vendor are twice as likely to fail compared to the others. We attribute the majority of switch failures to hardware impairments and unplanned power losses. We find that the in-house switch operating system, SONiC, boosts the survival likelihood of switches in datacenters by 1% by eliminating switch failures caused by software bugs in vendor switch OSes.
云数据中心中幸存的交换机故障
交换机故障会妨碍对客户端服务的访问,导致链路拥塞和黑洞网络流量。在本研究中,我们通过生存理论的视角,研究了一家大型商业云提供商数据中心交换机故障的本质。我们研究了超过180,000台具有各种硬件和软件配置的交换机,发现数据中心交换机自部署到生产环境中以来,有98%的可能性可以不间断地运行超过3个月。然而,在硬件和软件方面,交换机存活率存在显著的异质性:一个供应商的交换机故障的可能性是其他供应商的两倍。我们将大多数开关故障归因于硬件损坏和意外功率损耗。我们发现,内部交换机操作系统SONiC通过消除供应商交换机操作系统中的软件错误导致的交换机故障,将数据中心交换机的生存可能性提高了1%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Sigcomm Computer Communication Review
ACM Sigcomm Computer Communication Review 工程技术-计算机:信息系统
CiteScore
6.90
自引率
3.60%
发文量
20
审稿时长
4-8 weeks
期刊介绍: Computer Communication Review (CCR) is an online publication of the ACM Special Interest Group on Data Communication (SIGCOMM) and publishes articles on topics within the SIG''s field of interest. Technical papers accepted to CCR typically report on practical advances or the practical applications of theoretical advances. CCR serves as a forum for interesting and novel ideas at an early stage in their development. The focus is on timely dissemination of new ideas that may help trigger additional investigations. While the innovation and timeliness are the major criteria for its acceptance, technical robustness and readability will also be considered in the review process. We particularly encourage papers with early evaluation or feasibility studies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信