Xiwei Xu, Liming Zhu, Daniel W. Sun, An Binh Tran, I. Weber, Min Fu, L. Bass
{"title":"基于贝叶斯网络和在线优化的云应用运行错误诊断","authors":"Xiwei Xu, Liming Zhu, Daniel W. Sun, An Binh Tran, I. Weber, Min Fu, L. Bass","doi":"10.1109/EDCC.2015.15","DOIUrl":null,"url":null,"abstract":"Operations such as upgrade or redeployment are an important cause of system outages. Diagnosing such errors at runtime poses significant challenges. In this paper, we propose an error diagnosis approach using Bayesian Networks. Each node in the network captures the potential (root) causes of operational errors and its probability under different operational contexts. Once an operational error is detected, our diagnosis algorithm chooses a starting node, traverses the Bayesian Network and performs assertion checking associated with each node to confirm the error, retrieve further information and update the belief network. The next node in the network to check is selected through an online optimisation that minimises the overall availability risk considering diagnosis time and fault consequence. Our experiments show that the technique minimises the risk of faults significantly compared to other approaches in most cases. The diagnosis accuracy is high but also depends on the transient nature of a fault.","PeriodicalId":138826,"journal":{"name":"2015 11th European Dependable Computing Conference (EDCC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Error Diagnosis of Cloud Application Operation Using Bayesian Networks and Online Optimisation\",\"authors\":\"Xiwei Xu, Liming Zhu, Daniel W. Sun, An Binh Tran, I. Weber, Min Fu, L. Bass\",\"doi\":\"10.1109/EDCC.2015.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Operations such as upgrade or redeployment are an important cause of system outages. Diagnosing such errors at runtime poses significant challenges. In this paper, we propose an error diagnosis approach using Bayesian Networks. Each node in the network captures the potential (root) causes of operational errors and its probability under different operational contexts. Once an operational error is detected, our diagnosis algorithm chooses a starting node, traverses the Bayesian Network and performs assertion checking associated with each node to confirm the error, retrieve further information and update the belief network. The next node in the network to check is selected through an online optimisation that minimises the overall availability risk considering diagnosis time and fault consequence. Our experiments show that the technique minimises the risk of faults significantly compared to other approaches in most cases. The diagnosis accuracy is high but also depends on the transient nature of a fault.\",\"PeriodicalId\":138826,\"journal\":{\"name\":\"2015 11th European Dependable Computing Conference (EDCC)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 11th European Dependable Computing Conference (EDCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EDCC.2015.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 11th European Dependable Computing Conference (EDCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDCC.2015.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Error Diagnosis of Cloud Application Operation Using Bayesian Networks and Online Optimisation
Operations such as upgrade or redeployment are an important cause of system outages. Diagnosing such errors at runtime poses significant challenges. In this paper, we propose an error diagnosis approach using Bayesian Networks. Each node in the network captures the potential (root) causes of operational errors and its probability under different operational contexts. Once an operational error is detected, our diagnosis algorithm chooses a starting node, traverses the Bayesian Network and performs assertion checking associated with each node to confirm the error, retrieve further information and update the belief network. The next node in the network to check is selected through an online optimisation that minimises the overall availability risk considering diagnosis time and fault consequence. Our experiments show that the technique minimises the risk of faults significantly compared to other approaches in most cases. The diagnosis accuracy is high but also depends on the transient nature of a fault.