Behnaz Arzani, S. Ciraci, B. T. Loo, A. Schuster, G. Outhred
{"title":"Taking the Blame Game out of Data Centers Operations with NetPoirot","authors":"Behnaz Arzani, S. Ciraci, B. T. Loo, A. Schuster, G. Outhred","doi":"10.1145/2934872.2934884","DOIUrl":null,"url":null,"abstract":"Today, root cause analysis of failures in data centers is mostly done through manual inspection. More often than not, cus- tomers blame the network as the culprit. However, other components of the system might have caused these failures. To troubleshoot, huge volumes of data are collected over the entire data center. Correlating such large volumes of diverse data collected from different vantage points is a daunting task even for the most skilled technicians. In this paper, we revisit the question: how much can you infer about a failure in the data center using TCP statistics collected at one of the endpoints? Using an agent that cap- tures TCP statistics we devised a classification algorithm that identifies the root cause of failure using this information at a single endpoint. Using insights derived from this classi- fication algorithm we identify dominant TCP metrics that indicate where/why problems occur in the network. We val- idate and test these methods using data that we collect over a period of six months in a production data center.","PeriodicalId":284960,"journal":{"name":"Proceedings of the 2016 ACM SIGCOMM Conference","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"97","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM SIGCOMM Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2934872.2934884","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 97
Abstract
Today, root cause analysis of failures in data centers is mostly done through manual inspection. More often than not, cus- tomers blame the network as the culprit. However, other components of the system might have caused these failures. To troubleshoot, huge volumes of data are collected over the entire data center. Correlating such large volumes of diverse data collected from different vantage points is a daunting task even for the most skilled technicians. In this paper, we revisit the question: how much can you infer about a failure in the data center using TCP statistics collected at one of the endpoints? Using an agent that cap- tures TCP statistics we devised a classification algorithm that identifies the root cause of failure using this information at a single endpoint. Using insights derived from this classi- fication algorithm we identify dominant TCP metrics that indicate where/why problems occur in the network. We val- idate and test these methods using data that we collect over a period of six months in a production data center.