Monitoring Cloud Service Unreachability at Scale

Kapil Agrawal, Viral Mehta, Sundararajan Renganathan, Sreangsu Acharyya, V. Padmanabhan, Chakri Kotipalli, Liting Zhao
{"title":"Monitoring Cloud Service Unreachability at Scale","authors":"Kapil Agrawal, Viral Mehta, Sundararajan Renganathan, Sreangsu Acharyya, V. Padmanabhan, Chakri Kotipalli, Liting Zhao","doi":"10.1109/INFOCOM42981.2021.9488778","DOIUrl":null,"url":null,"abstract":"We consider the problem of network unreachability in a global-scale cloud-hosted service that caters to hundreds of millions of users. Even when the service itself is up, the \"last mile\" between where users are, and the cloud is often the weak link that could render the service unreachable.We present NetDetector, a tool for detecting network-unreachability based on measurements from a client-based HTTP-ping service. NetDetector employs two models. The first, GA (Gaussian Alerts) models temporally averaged raw success rate of the HTTP-pings as a Gaussian distribution and flags significant dips below the mean as unreachability episodes. The second, more sophisticated approach (BB, or Beta-Binomial) models the health of network connectivity as the probability of an access request succeeding, estimates health from noisy samples, and alerts based on dips in health below a client-network-specific SLO (service-level objective) derived from data. These algorithms are enhanced by a drill-down technique that identifies a more precise scope of the unreachability event. We present promising results from GA, which has been in deployment, and the experimental BB detector over a 4-month period. For instance, GA flags 49 country-level unreachability incidents, of which 42 were labelled true positives based on investigation by on-call engineers (OCEs).","PeriodicalId":293079,"journal":{"name":"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFOCOM42981.2021.9488778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

We consider the problem of network unreachability in a global-scale cloud-hosted service that caters to hundreds of millions of users. Even when the service itself is up, the "last mile" between where users are, and the cloud is often the weak link that could render the service unreachable.We present NetDetector, a tool for detecting network-unreachability based on measurements from a client-based HTTP-ping service. NetDetector employs two models. The first, GA (Gaussian Alerts) models temporally averaged raw success rate of the HTTP-pings as a Gaussian distribution and flags significant dips below the mean as unreachability episodes. The second, more sophisticated approach (BB, or Beta-Binomial) models the health of network connectivity as the probability of an access request succeeding, estimates health from noisy samples, and alerts based on dips in health below a client-network-specific SLO (service-level objective) derived from data. These algorithms are enhanced by a drill-down technique that identifies a more precise scope of the unreachability event. We present promising results from GA, which has been in deployment, and the experimental BB detector over a 4-month period. For instance, GA flags 49 country-level unreachability incidents, of which 42 were labelled true positives based on investigation by on-call engineers (OCEs).
大规模监控云服务不可达性
我们考虑了全球规模的云托管服务中网络不可达性的问题,该服务迎合了数亿用户。即使服务本身已经启动,用户所在的位置和云之间的“最后一英里”往往是导致服务无法访问的薄弱环节。我们介绍了NetDetector,一个基于基于客户端的http ping服务测量来检测网络不可达性的工具。NetDetector采用两种模型。首先,GA(高斯警报)模型将http -ping的原始成功率暂时平均为高斯分布,并将显著低于平均值的情况标记为不可达事件。第二种更复杂的方法(BB,或Beta-Binomial)将网络连接的健康状况建模为访问请求成功的概率,从噪声样本中估计健康状况,并根据健康状况低于来自数据的特定于客户端网络的SLO(服务级别目标)发出警报。这些算法通过向下钻取技术得到增强,该技术可以更精确地确定不可达性事件的范围。我们介绍了已经部署的GA和实验BB探测器在4个月期间的令人鼓舞的结果。例如,GA标记了49个国家级不可达事件,其中42个根据随叫随到的工程师(OCEs)的调查被标记为真正的阳性事件。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信