Yong Xu, Xu Zhang, Chuan Luo, Si Qin, Rohitashwa Pandey, Chao Du, Qingwei Lin, Yingnong Dang, Andrew Zhou
{"title":"CARE: Infusing Causal Aware Thinking to Root Cause Analysis in Cloud System","authors":"Yong Xu, Xu Zhang, Chuan Luo, Si Qin, Rohitashwa Pandey, Chao Du, Qingwei Lin, Yingnong Dang, Andrew Zhou","doi":"10.1145/3447851.3458737","DOIUrl":null,"url":null,"abstract":"With millions of customers accessing online service all over the world, ensuring high service availability is very critical for cloud system. In recent years, empowered by advanced data mining and machine learning technology, there emerges extensive study on data-driven solution to detect anomalous system behavior and diagnose the root cause. However, without any surveilance of data generation process, the existing passive data-driven approach may lead to biased analysis result induced by observed and unobserved confounding factors in the dynamic and heterogeneous system, and thus affect service availability with misleading mitigation actions. In this paper, we propose to infuse causal thinking to the current data-driven solution for cloud system. We developed CARE, a causal aware root cause discovery engine, which utilizes Random Control Trial to proactively generate less ambiguous data for further analysis. A case study shows the application of CARE to Microsoft Office365.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447851.3458737","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
With millions of customers accessing online service all over the world, ensuring high service availability is very critical for cloud system. In recent years, empowered by advanced data mining and machine learning technology, there emerges extensive study on data-driven solution to detect anomalous system behavior and diagnose the root cause. However, without any surveilance of data generation process, the existing passive data-driven approach may lead to biased analysis result induced by observed and unobserved confounding factors in the dynamic and heterogeneous system, and thus affect service availability with misleading mitigation actions. In this paper, we propose to infuse causal thinking to the current data-driven solution for cloud system. We developed CARE, a causal aware root cause discovery engine, which utilizes Random Control Trial to proactively generate less ambiguous data for further analysis. A case study shows the application of CARE to Microsoft Office365.