A fault injection platform for learning AIOps models

F. Bagehorn, Jesus Rios, Saurabh Jha, Robert Filepp, L. Shwartz, Naoki Abe, Xi Yang
{"title":"A fault injection platform for learning AIOps models","authors":"F. Bagehorn, Jesus Rios, Saurabh Jha, Robert Filepp, L. Shwartz, Naoki Abe, Xi Yang","doi":"10.1145/3551349.3559503","DOIUrl":null,"url":null,"abstract":"In today’s IT environment with a growing number of costly outages, increasing complexity of the systems, and availability of massive operational data, there is a strengthening demand to effectively leverage Artificial Intelligence and Machine Learning (AI/ML) towards enhanced resiliency. In this paper, we present an automatic fault injection platform to enable and optimize the generation of data needed for building AI/ML models to support modern IT operations. The merits of our platform include the ease of use, the possibility to orchestrate complex fault scenarios and to optimize the data generation for the modeling task at hand. Specifically, we designed a fault injection service that (i) combines fault injection with data collection in a unified framework, (ii) supports hybrid and multi-cloud environments, and (iii) does not require programming skills for its use. Our current implementation covers the most common fault types both at the application and infrastructure levels. The platform also includes some AI capabilities. In particular, we demonstrate the interventional causal learning capability currently available in our platform. We show how our system is able to learn a model of error propagation in a micro-service application in a cloud environment (when the communication graph among micro-services is unknown and only logs are available) for use in subsequent applications such as fault localization.","PeriodicalId":197939,"journal":{"name":"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering","volume":"41 8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3551349.3559503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In today’s IT environment with a growing number of costly outages, increasing complexity of the systems, and availability of massive operational data, there is a strengthening demand to effectively leverage Artificial Intelligence and Machine Learning (AI/ML) towards enhanced resiliency. In this paper, we present an automatic fault injection platform to enable and optimize the generation of data needed for building AI/ML models to support modern IT operations. The merits of our platform include the ease of use, the possibility to orchestrate complex fault scenarios and to optimize the data generation for the modeling task at hand. Specifically, we designed a fault injection service that (i) combines fault injection with data collection in a unified framework, (ii) supports hybrid and multi-cloud environments, and (iii) does not require programming skills for its use. Our current implementation covers the most common fault types both at the application and infrastructure levels. The platform also includes some AI capabilities. In particular, we demonstrate the interventional causal learning capability currently available in our platform. We show how our system is able to learn a model of error propagation in a micro-service application in a cloud environment (when the communication graph among micro-services is unknown and only logs are available) for use in subsequent applications such as fault localization.
一个学习AIOps模型的故障注入平台
在当今的IT环境中,随着越来越多的代价高昂的中断,系统复杂性的增加以及大量操作数据的可用性,有效利用人工智能和机器学习(AI/ML)来增强弹性的需求不断增强。在本文中,我们提出了一个自动故障注入平台,以启用和优化构建AI/ML模型所需的数据生成,以支持现代IT操作。我们平台的优点包括易于使用,可以编排复杂的故障场景,并为手头的建模任务优化数据生成。具体来说,我们设计了一个故障注入服务,它(i)将故障注入与数据收集结合在一个统一的框架中,(ii)支持混合云和多云环境,(iii)不需要编程技能即可使用。我们当前的实现涵盖了应用程序和基础结构级别上最常见的故障类型。该平台还包括一些人工智能功能。特别是,我们展示了目前在我们的平台上可用的介入因果学习能力。我们展示了我们的系统如何能够在云环境中的微服务应用程序中学习错误传播模型(当微服务之间的通信图未知并且只有日志可用时),以便在后续应用程序(如故障定位)中使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信