PreFix: Switch Failure Prediction in Datacenter Networks

Shenglin Zhang, Y. Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen Yang, Peixian Liang, Dan Pei, Jun Xu, Yuzhi Zhang, Yu Chen, Hui Dong, Xianping Qu, Lei Song
{"title":"PreFix: Switch Failure Prediction in Datacenter Networks","authors":"Shenglin Zhang, Y. Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen Yang, Peixian Liang, Dan Pei, Jun Xu, Yuzhi Zhang, Yu Chen, Hui Dong, Xianping Qu, Lei Song","doi":"10.1145/3219617.3219643","DOIUrl":null,"url":null,"abstract":"In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and \"fix\" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84x10 -5 false positive ratio, outperforming the other failure prediction methods for computers and ISP devices.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"72","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3219617.3219643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 72

Abstract

In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84x10 -5 false positive ratio, outperforming the other failure prediction methods for computers and ISP devices.
前缀:数据中心网络中的交换机故障预测
在现代数据中心网络(dcn)中,网络设备的故障是常态,而不是例外,许多研究工作都集中在故障发生后的处理上。在本文中,我们采取了一种不同的方法,通过预测故障,从而使操作员可以在潜在故障发生之前进行干预和“修复”。具体来说,在我们提出的名为PreFix的系统中,我们的目标是在运行时确定在不久的将来是否会发生交换机故障。该预测是基于对当前交换机系统状态和历史交换机硬件故障案例的测量,这些故障案例已被网络运营商仔细标记。我们的主要观察结果是,同一交换机模型的故障在发生故障之前共享一些常见的syslog模式,我们可以应用机器学习方法来提取用于预测交换机故障的常见模式。我们新颖的机器学习特征集(消息模板序列、频率、季节性和激增)可以有效地处理噪声、样本不平衡和计算开销的挑战。我们在一个数据集上对PreFix进行了评估,该数据集收集了9397台交换机(3种不同的交换机型号),这些交换机部署在一家全球顶级搜索引擎拥有的20多个数据中心中,历时2年。PreFix的平均召回率为61.81%,假阳性率为1.84 × 10 -5,优于其他针对计算机和ISP设备的故障预测方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信