识别在线服务系统中事件诊断的根本原因度量

Canhua Wu, Nengwen Zhao, Lixin Wang, Xiaoqin Yang, Shining Li, Ming Zhang, Xing Jin, Xidao Wen, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, Dan Pei
{"title":"识别在线服务系统中事件诊断的根本原因度量","authors":"Canhua Wu, Nengwen Zhao, Lixin Wang, Xiaoqin Yang, Shining Li, Ming Zhang, Xing Jin, Xidao Wen, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, Dan Pei","doi":"10.1109/ISSRE52982.2021.00022","DOIUrl":null,"url":null,"abstract":"Incidents in online service systems could incur poor user experience and tremendous economic loss. To reduce the influence of incidents and guarantee service reliability, it is critical to identify root-cause metrics for engineers with clues to assist incident diagnosis. However, it is a challenging task due to the complicated dependencies and huge volume of various metrics in large-scale systems. Existing approaches are based on either anomaly detection or correlation analysis, performing not well in terms of accuracy or efficiency. To better understand the problem of root-cause metric identification, we conduct a preliminary study based on real-world data analysis and interactions with engineers. The key observation is that root-cause metrics should satisfy two requirements. One is that the metric is expected to behave abnormally during the incident; the other is that the anomaly pattern should meet physical meaning and engineers' demand. Motivated by the findings obtained from the study, we propose an effective approach named PatternMatcher to identifying root-cause metrics accurately. Specifically, PatternMatcher contains three steps, where coarse-grained anomaly detection aiming to filter out normal metrics, anomaly pattern classification aiming to filter out unimportant anomaly patterns, and root-cause metric ranking. An extensive study on four real-world datasets including 113 incident cases from a large commercial bank demonstrates that PatternMatcher outperforms all baseline approaches, achieving top-3 average accuracy of 0.91. Moreover, we have deployed PatternMatcher in practice and shared some successful cases from real deployment.","PeriodicalId":162410,"journal":{"name":"2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems\",\"authors\":\"Canhua Wu, Nengwen Zhao, Lixin Wang, Xiaoqin Yang, Shining Li, Ming Zhang, Xing Jin, Xidao Wen, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, Dan Pei\",\"doi\":\"10.1109/ISSRE52982.2021.00022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Incidents in online service systems could incur poor user experience and tremendous economic loss. To reduce the influence of incidents and guarantee service reliability, it is critical to identify root-cause metrics for engineers with clues to assist incident diagnosis. However, it is a challenging task due to the complicated dependencies and huge volume of various metrics in large-scale systems. Existing approaches are based on either anomaly detection or correlation analysis, performing not well in terms of accuracy or efficiency. To better understand the problem of root-cause metric identification, we conduct a preliminary study based on real-world data analysis and interactions with engineers. The key observation is that root-cause metrics should satisfy two requirements. One is that the metric is expected to behave abnormally during the incident; the other is that the anomaly pattern should meet physical meaning and engineers' demand. Motivated by the findings obtained from the study, we propose an effective approach named PatternMatcher to identifying root-cause metrics accurately. Specifically, PatternMatcher contains three steps, where coarse-grained anomaly detection aiming to filter out normal metrics, anomaly pattern classification aiming to filter out unimportant anomaly patterns, and root-cause metric ranking. An extensive study on four real-world datasets including 113 incident cases from a large commercial bank demonstrates that PatternMatcher outperforms all baseline approaches, achieving top-3 average accuracy of 0.91. Moreover, we have deployed PatternMatcher in practice and shared some successful cases from real deployment.\",\"PeriodicalId\":162410,\"journal\":{\"name\":\"2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSRE52982.2021.00022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSRE52982.2021.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

在线服务系统一旦发生故障,可能会导致用户体验不佳和巨大的经济损失。为了减少事件的影响,保证服务的可靠性,为工程师识别具有线索的根本原因度量来辅助事件诊断是至关重要的。然而,由于大型系统中各种指标的依赖关系复杂且数量庞大,这是一项具有挑战性的任务。现有的方法要么基于异常检测,要么基于相关分析,在准确性和效率方面表现不佳。为了更好地理解根本原因度量识别的问题,我们基于真实世界的数据分析和与工程师的互动进行了初步研究。关键的观察结果是,根本原因度量应该满足两个需求。一种是,在事件期间,度量被期望表现异常;二是异常模式应满足物理意义和工程需求。受研究结果的启发,我们提出了一种名为PatternMatcher的有效方法来准确识别根本原因度量。具体来说,PatternMatcher包含三个步骤,其中粗粒度异常检测旨在过滤掉正常指标,异常模式分类旨在过滤掉不重要的异常模式,以及根本原因度量排序。对四个真实世界数据集(包括来自一家大型商业银行的113个事件案例)的广泛研究表明,PatternMatcher优于所有基线方法,平均准确率为0.91,排名前三。此外,我们还在实践中部署了PatternMatcher,并分享了一些实际部署的成功案例。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems
Incidents in online service systems could incur poor user experience and tremendous economic loss. To reduce the influence of incidents and guarantee service reliability, it is critical to identify root-cause metrics for engineers with clues to assist incident diagnosis. However, it is a challenging task due to the complicated dependencies and huge volume of various metrics in large-scale systems. Existing approaches are based on either anomaly detection or correlation analysis, performing not well in terms of accuracy or efficiency. To better understand the problem of root-cause metric identification, we conduct a preliminary study based on real-world data analysis and interactions with engineers. The key observation is that root-cause metrics should satisfy two requirements. One is that the metric is expected to behave abnormally during the incident; the other is that the anomaly pattern should meet physical meaning and engineers' demand. Motivated by the findings obtained from the study, we propose an effective approach named PatternMatcher to identifying root-cause metrics accurately. Specifically, PatternMatcher contains three steps, where coarse-grained anomaly detection aiming to filter out normal metrics, anomaly pattern classification aiming to filter out unimportant anomaly patterns, and root-cause metric ranking. An extensive study on four real-world datasets including 113 incident cases from a large commercial bank demonstrates that PatternMatcher outperforms all baseline approaches, achieving top-3 average accuracy of 0.91. Moreover, we have deployed PatternMatcher in practice and shared some successful cases from real deployment.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信