Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach

ArXiv Pub Date : 2024-03-11 DOI:10.1145/3639477.3639745

Jinxi Kuang, Jinyang Liu, Junjie Huang, Renyi Zhong, Jiazhen Gu, Lan Yu, Rui Tan, Zengyin Yang, Michael R. Lyu

{"title":"Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach","authors":"Jinxi Kuang, Jinyang Liu, Junjie Huang, Renyi Zhong, Jiazhen Gu, Lan Yu, Rui Tan, Zengyin Yang, Michael R. Lyu","doi":"10.1145/3639477.3639745","DOIUrl":null,"url":null,"abstract":"Due to the scale and complexity of cloud systems, a system failure would trigger an\"alert storm\", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"30 22","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3639477.3639745","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Due to the scale and complexity of cloud systems, a system failure would trigger an"alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.

查看原文本刊更多论文

大规模云系统中的知识感知警报聚合：一种混合方法

由于云系统的规模和复杂性，系统故障会引发 "警报风暴"，即大量相关警报。虽然这些警报可以追溯到一些根本原因，但由于数量庞大，人工处理并不可行。因此，警报聚合对于帮助工程师集中精力查找根本原因和促进故障解决至关重要。现有方法通常利用基于语义相似性的方法或统计方法来聚合警报。然而，基于语义相似性的方法会忽略警报的因果关系，而统计方法则难以处理不常见的警报。为了解决这些局限性，我们引入了外部知识，即警报的标准操作程序（SOP）作为补充。我们提出了基于关联挖掘和 LLM（大语言模型）推理的新型混合方法 COLA，用于在线警报聚合。相关性挖掘模块能有效捕捉警报之间的时间和空间关系，以高效的方式测量它们之间的相关性。随后，只有置信度较低的不确定配对才会被转发到 LLM（语言模型）推理模块进行详细分析。这种混合设计既利用了频繁警报的统计证据，又利用了计算密集型 LLM 的推理能力，确保了 COLA 在实际场景中处理大量警报的整体效率。我们在大型云平台生产环境中收集的三个数据集上对 COLA 进行了评估。实验结果表明，COLA 的 F1 分数从 0.901 到 0.930 不等，超过了最先进的方法，实现了相当的效率。我们还分享了在实际云系统 Cloud X 中部署 COLA 的经验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量