Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-04-29 DOI:10.1109/ACCESS.2025.3565220

Diego Frazatto Pedroso;Luís Almeida;Lucas Eduardo Gulka Pulcinelli;William Akihiro Alves Aisawa;Inês Dutra;Sarita Mazzini Bruschi

{"title":"Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks","authors":"Diego Frazatto Pedroso;Luís Almeida;Lucas Eduardo Gulka Pulcinelli;William Akihiro Alves Aisawa;Inês Dutra;Sarita Mazzini Bruschi","doi":"10.1109/ACCESS.2025.3565220","DOIUrl":null,"url":null,"abstract":"Cloud computing technologies offer significant advantages in scalability and performance, enabling rapid deployment of applications. The adoption of microservices-oriented architectures has introduced an ecosystem characterized by an increased number of applications, frameworks, abstraction layers, orchestrators, and hypervisors, all operating within distributed systems. This complexity results in the generation of vast quantities of logs from diverse sources, making the analysis of these events an inherently challenging task, particularly in the absence of automation. To address this issue, Machine Learning techniques leveraging Large Language Models (LLMs) offer a promising approach for dynamically identifying patterns within these events. In this study, we propose a novel anomaly detection framework utilizing a microservices architecture deployed on Kubernetes and Istio, enhanced by an LLM model. The model was trained on various error scenarios, with Chaos Mesh employed as an error injection tool to simulate faults of different natures, and Locust used as a load generator to create workload stress conditions. After an anomaly is detected by the LLM model, we employ a dynamic Bayesian network to provide probabilistic inferences about the incident, proving the relationships between components and assessing the degree of impact among them. Additionally, a ChatBot powered by the same LLM model allows users to interact with the AI, ask questions about the detected incident, and gain deeper insights. The experimental results demonstrated the model’s effectiveness, reliably identifying all error events across various test scenarios. While it successfully avoided missing any anomalies, it did produce some false positives, which remain within acceptable limits.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"77550-77564"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10979844","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10979844/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Cloud computing technologies offer significant advantages in scalability and performance, enabling rapid deployment of applications. The adoption of microservices-oriented architectures has introduced an ecosystem characterized by an increased number of applications, frameworks, abstraction layers, orchestrators, and hypervisors, all operating within distributed systems. This complexity results in the generation of vast quantities of logs from diverse sources, making the analysis of these events an inherently challenging task, particularly in the absence of automation. To address this issue, Machine Learning techniques leveraging Large Language Models (LLMs) offer a promising approach for dynamically identifying patterns within these events. In this study, we propose a novel anomaly detection framework utilizing a microservices architecture deployed on Kubernetes and Istio, enhanced by an LLM model. The model was trained on various error scenarios, with Chaos Mesh employed as an error injection tool to simulate faults of different natures, and Locust used as a load generator to create workload stress conditions. After an anomaly is detected by the LLM model, we employ a dynamic Bayesian network to provide probabilistic inferences about the incident, proving the relationships between components and assessing the degree of impact among them. Additionally, a ChatBot powered by the same LLM model allows users to interact with the AI, ask questions about the detected incident, and gain deeper insights. The experimental results demonstrated the model’s effectiveness, reliably identifying all error events across various test scenarios. While it successfully avoided missing any anomalies, it did produce some false positives, which remain within acceptable limits.

查看原文本刊更多论文

使用大型语言模型和贝叶斯网络的云原生环境中的异常检测和根本原因分析

云计算技术在可伸缩性和性能方面提供了显著的优势，支持应用程序的快速部署。采用面向微服务的体系结构引入了一个生态系统，其特征是应用程序、框架、抽象层、编排器和管理程序的数量增加，所有这些都在分布式系统中运行。这种复杂性导致从不同来源生成大量日志，使这些事件的分析成为一项具有挑战性的任务，特别是在没有自动化的情况下。为了解决这个问题，利用大型语言模型（llm）的机器学习技术为动态识别这些事件中的模式提供了一种很有前途的方法。在这项研究中，我们提出了一个新的异常检测框架，利用部署在Kubernetes和Istio上的微服务架构，通过LLM模型进行增强。对模型进行各种错误场景的训练，使用Chaos Mesh作为错误注入工具模拟不同性质的故障，使用Locust作为负载生成器创建工作负载压力条件。在LLM模型检测到异常后，我们采用动态贝叶斯网络提供事件的概率推断，证明组件之间的关系并评估它们之间的影响程度。此外，由相同的LLM模型驱动的聊天机器人允许用户与AI交互，询问有关检测到的事件的问题，并获得更深入的见解。实验结果证明了该模型的有效性，可以可靠地识别各种测试场景下的所有错误事件。虽然它成功地避免了遗漏任何异常，但它确实产生了一些误报，这些误报仍在可接受的范围内。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.