Anomaly Detection on IBM Z Mainframes: Performance Analysis and More

Proceedings of the 16th ACM International Conference on Systems and Storage Pub Date : 2023-06-05 DOI:10.1145/3579370.3594770

Erik Altman, Benjamin Segal

{"title":"Anomaly Detection on IBM Z Mainframes: Performance Analysis and More","authors":"Erik Altman, Benjamin Segal","doi":"10.1145/3579370.3594770","DOIUrl":null,"url":null,"abstract":"Anomalous events can signal a variety of problems in any system. As such, robust, fast detection of anomalies is important so issues can be fixed before they cascade to create larger problems. In this paper we focus on IBM Z mainframes, although most of the problems addressed and techniques used are broadly applicable. For example, anomalies can signal issues such as disk malfunctions, slow or unresponsive modules, crashes and latent bugs, lock contention, excessive retries, the need to allocate more resources to reduce contention, etc. Although there are specific techniques for addressing individual issues, anomaly detection is useful in its broad spectrum utility, and its ability to identify combinations of problems for which there may not be a specific approach implemented. In addition, anomaly detection serves as a backstop: truly anomalous events suggest that normal mechanisms did not work. Our input for detecting anomalies is low-level, summarized information available in time series to the zOS operating system. Although such information lacks some high-level context, it does provide an operating system awareness that benefits from universal applicability to any zOS system and any code running on such a system. The data is also quite rich with 100 - 100,000 metrics per sample depending on how \"metric\" is defined. As might be expected the data contains metrics such as CPU utilization, execution priorities, internal locking behavior, bytes read and written by an executing process, etc. It also contains higher-level information such as the executing process names or transactional identifiers from online transactional processing facilities. Names are useful not only in detecting anomalies, but in conveying context to users trying to isolate and fix problems. Our techniques build on KL divergence [21] and learn continuously without supervision and with low overhead. Continuous learning is important. The first instance of an aberrant behavior is an anomaly. The 10th instance probably is not. This point also illustrates the utility of anomaly detection in pinpointing root cause: early detection is essential and broad-spectrum anomaly detection provides excellent capability to do just that. This paper outlines these techniques and demonstrates their efficacy in detecting and resolving key problems.","PeriodicalId":180024,"journal":{"name":"Proceedings of the 16th ACM International Conference on Systems and Storage","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM International Conference on Systems and Storage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579370.3594770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Anomalous events can signal a variety of problems in any system. As such, robust, fast detection of anomalies is important so issues can be fixed before they cascade to create larger problems. In this paper we focus on IBM Z mainframes, although most of the problems addressed and techniques used are broadly applicable. For example, anomalies can signal issues such as disk malfunctions, slow or unresponsive modules, crashes and latent bugs, lock contention, excessive retries, the need to allocate more resources to reduce contention, etc. Although there are specific techniques for addressing individual issues, anomaly detection is useful in its broad spectrum utility, and its ability to identify combinations of problems for which there may not be a specific approach implemented. In addition, anomaly detection serves as a backstop: truly anomalous events suggest that normal mechanisms did not work. Our input for detecting anomalies is low-level, summarized information available in time series to the zOS operating system. Although such information lacks some high-level context, it does provide an operating system awareness that benefits from universal applicability to any zOS system and any code running on such a system. The data is also quite rich with 100 - 100,000 metrics per sample depending on how "metric" is defined. As might be expected the data contains metrics such as CPU utilization, execution priorities, internal locking behavior, bytes read and written by an executing process, etc. It also contains higher-level information such as the executing process names or transactional identifiers from online transactional processing facilities. Names are useful not only in detecting anomalies, but in conveying context to users trying to isolate and fix problems. Our techniques build on KL divergence [21] and learn continuously without supervision and with low overhead. Continuous learning is important. The first instance of an aberrant behavior is an anomaly. The 10th instance probably is not. This point also illustrates the utility of anomaly detection in pinpointing root cause: early detection is essential and broad-spectrum anomaly detection provides excellent capability to do just that. This paper outlines these techniques and demonstrates their efficacy in detecting and resolving key problems.

查看原文本刊更多论文

IBM Z大型机上的异常检测:性能分析等

异常事件可以表明任何系统中的各种问题。因此，强大、快速的异常检测非常重要，这样可以在问题级联产生更大的问题之前解决问题。在本文中，我们主要关注IBM Z大型机，尽管所处理的大多数问题和使用的技术都是广泛适用的。例如，异常可以表示诸如磁盘故障、模块缓慢或无响应、崩溃和潜在错误、锁争用、过多重试、需要分配更多资源以减少争用等问题。尽管存在解决个别问题的特定技术，但异常检测在其广泛的实用程序中是有用的，并且它能够识别可能没有实现特定方法的问题的组合。此外，异常检测可以作为后盾:真正的异常事件表明正常机制不起作用。我们用于检测异常的输入是以时间序列提供给zOS操作系统的低级汇总信息。尽管此类信息缺乏一些高级上下文，但它确实提供了一种操作系统感知，这种感知受益于对任何zOS系统和在这种系统上运行的任何代码的普遍适用性。数据也非常丰富，每个样本有100 - 100,000个指标，具体取决于“指标”的定义方式。正如预期的那样，数据包含诸如CPU利用率、执行优先级、内部锁定行为、执行进程读取和写入的字节等指标。它还包含高级信息，如执行进程名或来自在线事务处理设施的事务标识符。名称不仅在检测异常时很有用，而且在向试图隔离和修复问题的用户传达上下文时也很有用。我们的技术建立在KL散度的基础上[21]，并且在没有监督和低开销的情况下持续学习。持续学习很重要。异常行为的第一个实例就是异常。第十例可能不是。这一点还说明了异常检测在确定根本原因方面的效用:早期检测是必不可少的，广谱异常检测提供了出色的功能。本文概述了这些技术，并论证了它们在检测和解决关键问题方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th ACM International Conference on Systems and Storage

自引率

0.00%

发文量