Zhaowen Wang;Yong Zhou;YangSiyu Zhang;Fengyu Cong;Dongdong Zhou;Zhijian An
{"title":"PA-Rank:基于GAN和强化学习的多度量异常检测和原因诊断框架","authors":"Zhaowen Wang;Yong Zhou;YangSiyu Zhang;Fengyu Cong;Dongdong Zhou;Zhijian An","doi":"10.1109/JIOT.2025.3567091","DOIUrl":null,"url":null,"abstract":"The increasing scale and complexity of modern IT systems necessitate advanced solutions for monitoring and managing performance anomalies. Artificial intelligence for IT operations (AIOps) has emerged as a promising approach to enhance the efficiency and effectiveness of IT operations. However, existing methods struggle with effectively detecting anomalies in multidimensional performance data and accurately identifying their root causes in complex interdependent systems. This article proposes a novel framework, PA-Rank, that combines generative adversarial networks (GANs), reinforcement learning, and graph-based methods to address these challenges comprehensively. For anomaly detection, an unsupervised GAN-based model is developed to identify anomalous time periods and assign weighted scores to metrics, facilitating precise anomaly identification. For root cause localization, a causal graph construction model (CGCM) has been developed, utilizing a reinforcement learning-based causal discovery method that is integrated with graph attention networks (GAT) to construct a causal graph representing the relationships between metrics. A random walk algorithm further ranks metric importance during anomalies, enabling effective root cause localization. Extensive experiments on real-world datasets, including server machine dataset (SMD), ASD, and DAMADICS, demonstrate the superiority of PA-Rank over traditional statistical and state-of-the-art machine learning methods. On the SMD dataset, the proposed framework achieved an F1 score of 0.9542 for anomaly detection and consistently identified root causes among top-ranked candidates on the Pymicro and RMS datasets with the highest PR@Avg scores. These results underscore PA-Rank’s efficacy in diagnosing performance anomalies and supporting efficient system maintenance.","PeriodicalId":54347,"journal":{"name":"IEEE Internet of Things Journal","volume":"12 14","pages":"28889-28898"},"PeriodicalIF":8.9000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PA-Rank: A GAN and Reinforcement Learning Powered Framework for Multimetric Anomaly Detection and Causal Diagnosis\",\"authors\":\"Zhaowen Wang;Yong Zhou;YangSiyu Zhang;Fengyu Cong;Dongdong Zhou;Zhijian An\",\"doi\":\"10.1109/JIOT.2025.3567091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The increasing scale and complexity of modern IT systems necessitate advanced solutions for monitoring and managing performance anomalies. Artificial intelligence for IT operations (AIOps) has emerged as a promising approach to enhance the efficiency and effectiveness of IT operations. However, existing methods struggle with effectively detecting anomalies in multidimensional performance data and accurately identifying their root causes in complex interdependent systems. This article proposes a novel framework, PA-Rank, that combines generative adversarial networks (GANs), reinforcement learning, and graph-based methods to address these challenges comprehensively. For anomaly detection, an unsupervised GAN-based model is developed to identify anomalous time periods and assign weighted scores to metrics, facilitating precise anomaly identification. For root cause localization, a causal graph construction model (CGCM) has been developed, utilizing a reinforcement learning-based causal discovery method that is integrated with graph attention networks (GAT) to construct a causal graph representing the relationships between metrics. A random walk algorithm further ranks metric importance during anomalies, enabling effective root cause localization. Extensive experiments on real-world datasets, including server machine dataset (SMD), ASD, and DAMADICS, demonstrate the superiority of PA-Rank over traditional statistical and state-of-the-art machine learning methods. On the SMD dataset, the proposed framework achieved an F1 score of 0.9542 for anomaly detection and consistently identified root causes among top-ranked candidates on the Pymicro and RMS datasets with the highest PR@Avg scores. These results underscore PA-Rank’s efficacy in diagnosing performance anomalies and supporting efficient system maintenance.\",\"PeriodicalId\":54347,\"journal\":{\"name\":\"IEEE Internet of Things Journal\",\"volume\":\"12 14\",\"pages\":\"28889-28898\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-03-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Internet of Things Journal\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10988663/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Internet of Things Journal","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10988663/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
PA-Rank: A GAN and Reinforcement Learning Powered Framework for Multimetric Anomaly Detection and Causal Diagnosis
The increasing scale and complexity of modern IT systems necessitate advanced solutions for monitoring and managing performance anomalies. Artificial intelligence for IT operations (AIOps) has emerged as a promising approach to enhance the efficiency and effectiveness of IT operations. However, existing methods struggle with effectively detecting anomalies in multidimensional performance data and accurately identifying their root causes in complex interdependent systems. This article proposes a novel framework, PA-Rank, that combines generative adversarial networks (GANs), reinforcement learning, and graph-based methods to address these challenges comprehensively. For anomaly detection, an unsupervised GAN-based model is developed to identify anomalous time periods and assign weighted scores to metrics, facilitating precise anomaly identification. For root cause localization, a causal graph construction model (CGCM) has been developed, utilizing a reinforcement learning-based causal discovery method that is integrated with graph attention networks (GAT) to construct a causal graph representing the relationships between metrics. A random walk algorithm further ranks metric importance during anomalies, enabling effective root cause localization. Extensive experiments on real-world datasets, including server machine dataset (SMD), ASD, and DAMADICS, demonstrate the superiority of PA-Rank over traditional statistical and state-of-the-art machine learning methods. On the SMD dataset, the proposed framework achieved an F1 score of 0.9542 for anomaly detection and consistently identified root causes among top-ranked candidates on the Pymicro and RMS datasets with the highest PR@Avg scores. These results underscore PA-Rank’s efficacy in diagnosing performance anomalies and supporting efficient system maintenance.
期刊介绍:
The EEE Internet of Things (IoT) Journal publishes articles and review articles covering various aspects of IoT, including IoT system architecture, IoT enabling technologies, IoT communication and networking protocols such as network coding, and IoT services and applications. Topics encompass IoT's impacts on sensor technologies, big data management, and future internet design for applications like smart cities and smart homes. Fields of interest include IoT architecture such as things-centric, data-centric, service-oriented IoT architecture; IoT enabling technologies and systematic integration such as sensor technologies, big sensor data management, and future Internet design for IoT; IoT services, applications, and test-beds such as IoT service middleware, IoT application programming interface (API), IoT application design, and IoT trials/experiments; IoT standardization activities and technology development in different standard development organizations (SDO) such as IEEE, IETF, ITU, 3GPP, ETSI, etc.