{"title":"Monitoring Cloud Service Unreachability at Scale","authors":"Kapil Agrawal, Viral Mehta, Sundararajan Renganathan, Sreangsu Acharyya, V. Padmanabhan, Chakri Kotipalli, Liting Zhao","doi":"10.1109/INFOCOM42981.2021.9488778","DOIUrl":null,"url":null,"abstract":"We consider the problem of network unreachability in a global-scale cloud-hosted service that caters to hundreds of millions of users. Even when the service itself is up, the \"last mile\" between where users are, and the cloud is often the weak link that could render the service unreachable.We present NetDetector, a tool for detecting network-unreachability based on measurements from a client-based HTTP-ping service. NetDetector employs two models. The first, GA (Gaussian Alerts) models temporally averaged raw success rate of the HTTP-pings as a Gaussian distribution and flags significant dips below the mean as unreachability episodes. The second, more sophisticated approach (BB, or Beta-Binomial) models the health of network connectivity as the probability of an access request succeeding, estimates health from noisy samples, and alerts based on dips in health below a client-network-specific SLO (service-level objective) derived from data. These algorithms are enhanced by a drill-down technique that identifies a more precise scope of the unreachability event. We present promising results from GA, which has been in deployment, and the experimental BB detector over a 4-month period. For instance, GA flags 49 country-level unreachability incidents, of which 42 were labelled true positives based on investigation by on-call engineers (OCEs).","PeriodicalId":293079,"journal":{"name":"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE INFOCOM 2021 - IEEE Conference on Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFOCOM42981.2021.9488778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
We consider the problem of network unreachability in a global-scale cloud-hosted service that caters to hundreds of millions of users. Even when the service itself is up, the "last mile" between where users are, and the cloud is often the weak link that could render the service unreachable.We present NetDetector, a tool for detecting network-unreachability based on measurements from a client-based HTTP-ping service. NetDetector employs two models. The first, GA (Gaussian Alerts) models temporally averaged raw success rate of the HTTP-pings as a Gaussian distribution and flags significant dips below the mean as unreachability episodes. The second, more sophisticated approach (BB, or Beta-Binomial) models the health of network connectivity as the probability of an access request succeeding, estimates health from noisy samples, and alerts based on dips in health below a client-network-specific SLO (service-level objective) derived from data. These algorithms are enhanced by a drill-down technique that identifies a more precise scope of the unreachability event. We present promising results from GA, which has been in deployment, and the experimental BB detector over a 4-month period. For instance, GA flags 49 country-level unreachability incidents, of which 42 were labelled true positives based on investigation by on-call engineers (OCEs).