Haoshi Ren, Lihai Nie, Hongyun Gao, Laiping Zhao, J. Diao
{"title":"NetCruiser: Localize Network Failures by Learning from Latency Data","authors":"Haoshi Ren, Lihai Nie, Hongyun Gao, Laiping Zhao, J. Diao","doi":"10.1109/SmartIoT49966.2020.00013","DOIUrl":null,"url":null,"abstract":"In modern data center networks (DCNs), failures of network devices always occur and it is difficult to localize these failures. Our key observation is that latency data can reflect and profile network status. We can use this information to resolve issues like network failure localization.In this paper, we present NetCruiser, a system that is able to localize failures by learning from latency data. It can both measure and collect latency data to monitor the status of the whole network and pinpoint which switch or router encounters a failure. And we design a data structure to handle these latency data. With the construction of this data structure, we build a machine learning model to infer where issue occurs. Therefore, by the usage of this system, it answers the question about which switch encounters a failure in network. Our experimental evaluation has validated both the efficiency and effectiveness of our approach. Our system can be widely applied to both inter-DC network and intra-DC network.","PeriodicalId":399187,"journal":{"name":"2020 IEEE International Conference on Smart Internet of Things (SmartIoT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Smart Internet of Things (SmartIoT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartIoT49966.2020.00013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In modern data center networks (DCNs), failures of network devices always occur and it is difficult to localize these failures. Our key observation is that latency data can reflect and profile network status. We can use this information to resolve issues like network failure localization.In this paper, we present NetCruiser, a system that is able to localize failures by learning from latency data. It can both measure and collect latency data to monitor the status of the whole network and pinpoint which switch or router encounters a failure. And we design a data structure to handle these latency data. With the construction of this data structure, we build a machine learning model to infer where issue occurs. Therefore, by the usage of this system, it answers the question about which switch encounters a failure in network. Our experimental evaluation has validated both the efficiency and effectiveness of our approach. Our system can be widely applied to both inter-DC network and intra-DC network.