数据中心异常排行

2012 IEEE Network Operations and Management Symposium Pub Date : 2012-04-16 DOI:10.1109/NOMS.2012.6211885

K. Viswanathan, C. Lakshminarayan, V. Talwar, Chengwei Wang, Greg Macdonald, W. Satterfield

{"title":"数据中心异常排行","authors":"K. Viswanathan, C. Lakshminarayan, V. Talwar, Chengwei Wang, Greg Macdonald, W. Satterfield","doi":"10.1109/NOMS.2012.6211885","DOIUrl":null,"url":null,"abstract":"Data centers are growing in size and complexity driven by trends such as cloud computing and on-line services. Such large data centers pose several challenges for system management. Key among them is anomaly detection which is required to monitor and analyze metrics across several thousands servers and across multiple layers of abstractions to detect anomalous system behavior. In practice, multiple anomaly detection tools are used to continuously raise alarms across multiple metrics and servers. These alarms include both true positives and false alarms. Administrators and management tools act on these alarms for diagnosis and deeper root cause analysis and take appropriate management actions to mitigate the anomalous behaviors. Given the scale and scope of the system, the administrators and management tools are overwhelmed with the large number of alarms at any given instant, many of which are false alarms. It is therefore necessary to prioritize and rank these alarms, so as to take timely actions that maintain the service level agreements for the data center. Existing techniques for such ranking are ad-hoc and not scalable. We propose ranking windows of monitored metrics based on their probability of occurrence. We explain how these probabilities can be computed based either on the false positive rates for which the accompanying anomaly detectors were designed, or, when available, on the probability models underlying the false positive rates. In the simplest case, the ranking procedure reduces to computing the Z-score of the observed measurements and computing a statistic from a window of Z-scores to use as a basis for ranking. The proposed techniques are reliable, lightweight and easy to deploy in the modern data center. We have validated these techniques on synthetic data containing injected anomalies and on data acquired from production data centers.","PeriodicalId":364494,"journal":{"name":"2012 IEEE Network Operations and Management Symposium","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":"{\"title\":\"Ranking anomalies in data centers\",\"authors\":\"K. Viswanathan, C. Lakshminarayan, V. Talwar, Chengwei Wang, Greg Macdonald, W. Satterfield\",\"doi\":\"10.1109/NOMS.2012.6211885\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data centers are growing in size and complexity driven by trends such as cloud computing and on-line services. Such large data centers pose several challenges for system management. Key among them is anomaly detection which is required to monitor and analyze metrics across several thousands servers and across multiple layers of abstractions to detect anomalous system behavior. In practice, multiple anomaly detection tools are used to continuously raise alarms across multiple metrics and servers. These alarms include both true positives and false alarms. Administrators and management tools act on these alarms for diagnosis and deeper root cause analysis and take appropriate management actions to mitigate the anomalous behaviors. Given the scale and scope of the system, the administrators and management tools are overwhelmed with the large number of alarms at any given instant, many of which are false alarms. It is therefore necessary to prioritize and rank these alarms, so as to take timely actions that maintain the service level agreements for the data center. Existing techniques for such ranking are ad-hoc and not scalable. We propose ranking windows of monitored metrics based on their probability of occurrence. We explain how these probabilities can be computed based either on the false positive rates for which the accompanying anomaly detectors were designed, or, when available, on the probability models underlying the false positive rates. In the simplest case, the ranking procedure reduces to computing the Z-score of the observed measurements and computing a statistic from a window of Z-scores to use as a basis for ranking. The proposed techniques are reliable, lightweight and easy to deploy in the modern data center. We have validated these techniques on synthetic data containing injected anomalies and on data acquired from production data centers.\",\"PeriodicalId\":364494,\"journal\":{\"name\":\"2012 IEEE Network Operations and Management Symposium\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"27\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE Network Operations and Management Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NOMS.2012.6211885\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Network Operations and Management Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NOMS.2012.6211885","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

摘要

在云计算和在线服务等趋势的推动下，数据中心的规模和复杂性都在不断增长。这样的大型数据中心给系统管理带来了一些挑战。其中的关键是异常检测，它需要监视和分析数千台服务器和多个抽象层的度量，以检测异常的系统行为。在实践中，使用多个异常检测工具在多个指标和服务器上连续发出警报。这些警报包括真阳性和假警报。管理员和管理工具对这些告警进行诊断和更深层次的根本原因分析，并采取适当的管理措施来减轻异常行为。考虑到系统的规模和范围，管理员和管理工具在任何给定的时刻都被大量的警报所淹没，其中许多是假警报。因此，有必要对这些告警进行排序，以便及时采取措施维护数据中心的服务水平协议。现有的这种排名技术是临时的，不可扩展的。我们提出了基于其发生概率的被监控指标的排序窗口。我们解释了这些概率是如何基于假阳性率来计算的，而伴随的异常检测器是为假阳性率设计的，或者，在可用的情况下，基于假阳性率背后的概率模型。在最简单的情况下，排名过程简化为计算观察到的测量的z分数，并从z分数窗口计算统计数据，以用作排名的基础。所提出的技术具有可靠、轻量级和易于在现代数据中心部署的特点。我们已经在包含注入异常的合成数据和从生产数据中心获取的数据上验证了这些技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Ranking anomalies in data centers

Data centers are growing in size and complexity driven by trends such as cloud computing and on-line services. Such large data centers pose several challenges for system management. Key among them is anomaly detection which is required to monitor and analyze metrics across several thousands servers and across multiple layers of abstractions to detect anomalous system behavior. In practice, multiple anomaly detection tools are used to continuously raise alarms across multiple metrics and servers. These alarms include both true positives and false alarms. Administrators and management tools act on these alarms for diagnosis and deeper root cause analysis and take appropriate management actions to mitigate the anomalous behaviors. Given the scale and scope of the system, the administrators and management tools are overwhelmed with the large number of alarms at any given instant, many of which are false alarms. It is therefore necessary to prioritize and rank these alarms, so as to take timely actions that maintain the service level agreements for the data center. Existing techniques for such ranking are ad-hoc and not scalable. We propose ranking windows of monitored metrics based on their probability of occurrence. We explain how these probabilities can be computed based either on the false positive rates for which the accompanying anomaly detectors were designed, or, when available, on the probability models underlying the false positive rates. In the simplest case, the ranking procedure reduces to computing the Z-score of the observed measurements and computing a statistic from a window of Z-scores to use as a basis for ranking. The proposed techniques are reliable, lightweight and easy to deploy in the modern data center. We have validated these techniques on synthetic data containing injected anomalies and on data acquired from production data centers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE Network Operations and Management Symposium

自引率

0.00%

发文量