子图作为大规模在线系统事件管理中的一等公民：一个进化感知框架

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-07-17 DOI:10.1109/TSE.2025.3590221

Zilong He;Pengfei Chen;Yu Luo;Qiuyu Yan;Hongyang Chen;Guangba Yu;Fangyuan Li;Xiaoyun Li;Zibin Zheng

{"title":"子图作为大规模在线系统事件管理中的一等公民：一个进化感知框架","authors":"Zilong He;Pengfei Chen;Yu Luo;Qiuyu Yan;Hongyang Chen;Guangba Yu;Fangyuan Li;Xiaoyun Li;Zibin Zheng","doi":"10.1109/TSE.2025.3590221","DOIUrl":null,"url":null,"abstract":"With the ever-increasing scale and complexity of modern online systems, incidents are becoming inevitable, which seriously decreases the system availability and user satisfaction. To enhance incident management, many machine learning based techniques are proposed to automate incident detection and diagnosis. However, previous studies have mostly ignored the impact of evolution on the practicality of an incident management framework. Specifically, (1) The scale of modern online systems is continually evolving, but most state-of-the-art techniques are overly dependent on a continuous modelling of the entire system, and thus are less practical for online systems evolved to tens of thousands of services; (2) The volume of telemetry data is massively growing, while the number of incident records for learning is scarce and slowly generated (sometimes from zero), but prior techniques usually neglect this extreme imbalance in data volume evolution, and cannot support the life-cycle evolution (i.e., cold start and continual learning) of their developed models; (3) Prior techniques usually require operators to manually select a set of telemetry as inputs for incident diagnosis, but ignore how to automatically evolve this selection to continually improve diagnosis performance. These gaps stem from the unawareness of evolution, including the evolution of the target online system and the evolution of the built incident management models. To fill these gaps, we propose an evolution-aware incident management framework <sc>Gem. Specifically, considering the evolution of system scale and data volume, <sc>Gem continually refines the enormous real-time collected telemetry data into individual compact yet expressive graph-based representations, namely issue impact subgraphs, and treat them as the first-class citizens in incident management. Centered around these subgraphs, we design a couple of lifelong learning based graph analysis techniques to learn and evolve models for incident detection and diagnosis. We evaluate <sc>Gem using real-world data collected from the WeChat online system, the largest instant messaging software in China. The results confirm the effectiveness of <sc>Gem. Moreover, <sc>Gem is successfully deployed in WeChat, easing the burden of operators in handling a flood of issues and related telemetry data.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2494-2511"},"PeriodicalIF":5.6000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Subgraphs as First-Class Citizens in Incident Management for Large-Scale Online Systems: An Evolution-Aware Framework\",\"authors\":\"Zilong He;Pengfei Chen;Yu Luo;Qiuyu Yan;Hongyang Chen;Guangba Yu;Fangyuan Li;Xiaoyun Li;Zibin Zheng\",\"doi\":\"10.1109/TSE.2025.3590221\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the ever-increasing scale and complexity of modern online systems, incidents are becoming inevitable, which seriously decreases the system availability and user satisfaction. To enhance incident management, many machine learning based techniques are proposed to automate incident detection and diagnosis. However, previous studies have mostly ignored the impact of evolution on the practicality of an incident management framework. Specifically, (1) The scale of modern online systems is continually evolving, but most state-of-the-art techniques are overly dependent on a continuous modelling of the entire system, and thus are less practical for online systems evolved to tens of thousands of services; (2) The volume of telemetry data is massively growing, while the number of incident records for learning is scarce and slowly generated (sometimes from zero), but prior techniques usually neglect this extreme imbalance in data volume evolution, and cannot support the life-cycle evolution (i.e., cold start and continual learning) of their developed models; (3) Prior techniques usually require operators to manually select a set of telemetry as inputs for incident diagnosis, but ignore how to automatically evolve this selection to continually improve diagnosis performance. These gaps stem from the unawareness of evolution, including the evolution of the target online system and the evolution of the built incident management models. To fill these gaps, we propose an evolution-aware incident management framework <sc>Gem. Specifically, considering the evolution of system scale and data volume, <sc>Gem continually refines the enormous real-time collected telemetry data into individual compact yet expressive graph-based representations, namely issue impact subgraphs, and treat them as the first-class citizens in incident management. Centered around these subgraphs, we design a couple of lifelong learning based graph analysis techniques to learn and evolve models for incident detection and diagnosis. We evaluate <sc>Gem using real-world data collected from the WeChat online system, the largest instant messaging software in China. The results confirm the effectiveness of <sc>Gem. Moreover, <sc>Gem is successfully deployed in WeChat, easing the burden of operators in handling a flood of issues and related telemetry data.\",\"PeriodicalId\":13324,\"journal\":{\"name\":\"IEEE Transactions on Software Engineering\",\"volume\":\"51 9\",\"pages\":\"2494-2511\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2025-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11082738/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11082738/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

随着现代在线系统规模的不断扩大和复杂性的不断提高，事故的发生在所难免，严重降低了系统的可用性和用户满意度。为了加强事件管理，提出了许多基于机器学习的技术来自动检测和诊断事件。然而，以往的研究大多忽略了演化对事件管理框架实用性的影响。具体而言，(1)现代在线系统的规模在不断发展，但大多数最先进的技术过于依赖于整个系统的连续建模，因此对于发展到数万个服务的在线系统不太实用；(2)遥测数据量大量增长，而用于学习的事件记录数量稀少且生成缓慢（有时从零开始），但先前的技术通常忽略了数据量进化中的这种极端不平衡，无法支持其所开发模型的生命周期进化（即冷启动和持续学习）；(3)以往的技术通常需要操作员手动选择一组遥测数据作为事件诊断的输入，但忽略了如何自动发展这种选择以不断提高诊断性能。这些差距源于对发展的不了解，包括目标在线系统的发展和构建的事件管理模型的发展。为了填补这些空白，我们提出了一个进化感知事件管理框架Gem。具体而言，考虑到系统规模和数据量的演变，Gem不断将海量实时采集的遥测数据提炼为单个紧凑而富有表现力的图形表示，即问题影响子图，并将其视为事件管理中的一等公民。围绕这些子图，我们设计了几个基于终身学习的图分析技术，以学习和进化事件检测和诊断模型。我们使用从中国最大的即时通讯软件微信在线系统收集的真实数据来评估Gem。结果证实了Gem的有效性。此外，Gem在微信成功部署，减轻了运营商处理大量问题和相关遥测数据的负担。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Subgraphs as First-Class Citizens in Incident Management for Large-Scale Online Systems: An Evolution-Aware Framework

With the ever-increasing scale and complexity of modern online systems, incidents are becoming inevitable, which seriously decreases the system availability and user satisfaction. To enhance incident management, many machine learning based techniques are proposed to automate incident detection and diagnosis. However, previous studies have mostly ignored the impact of evolution on the practicality of an incident management framework. Specifically, (1) The scale of modern online systems is continually evolving, but most state-of-the-art techniques are overly dependent on a continuous modelling of the entire system, and thus are less practical for online systems evolved to tens of thousands of services; (2) The volume of telemetry data is massively growing, while the number of incident records for learning is scarce and slowly generated (sometimes from zero), but prior techniques usually neglect this extreme imbalance in data volume evolution, and cannot support the life-cycle evolution (i.e., cold start and continual learning) of their developed models; (3) Prior techniques usually require operators to manually select a set of telemetry as inputs for incident diagnosis, but ignore how to automatically evolve this selection to continually improve diagnosis performance. These gaps stem from the unawareness of evolution, including the evolution of the target online system and the evolution of the built incident management models. To fill these gaps, we propose an evolution-aware incident management framework Gem. Specifically, considering the evolution of system scale and data volume, Gem continually refines the enormous real-time collected telemetry data into individual compact yet expressive graph-based representations, namely issue impact subgraphs, and treat them as the first-class citizens in incident management. Centered around these subgraphs, we design a couple of lifelong learning based graph analysis techniques to learn and evolve models for incident detection and diagnosis. We evaluate Gem using real-world data collected from the WeChat online system, the largest instant messaging software in China. The results confirm the effectiveness of Gem. Moreover, Gem is successfully deployed in WeChat, easing the burden of operators in handling a flood of issues and related telemetry data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.