SELECTION OF METRIC AND CATEGORICAL ATTRIBUTES OF RARE ANOMALOUS EVENTS IN A COMPUTER SYSTEM USING DATA MINING METHODS

T-Comm Pub Date : 1900-01-01 DOI:10.36724/2072-8735-2021-15-6-40-47

O. Sheluhin, D. Rakovsky

{"title":"SELECTION OF METRIC AND CATEGORICAL ATTRIBUTES OF RARE ANOMALOUS EVENTS IN A COMPUTER SYSTEM USING DATA MINING METHODS","authors":"O. Sheluhin, D. Rakovsky","doi":"10.36724/2072-8735-2021-15-6-40-47","DOIUrl":null,"url":null,"abstract":"The process of marking multi-attribute experimental data for subsequent use by means of data mining in problems of detection and classification of rare anomalous events of computer systems (CS) is considered. The labeling process is carried out using three methods: manual preprocessing, statistical analysis and cluster analysis. Among the attributes of the metric type, the authors identified two macrogroups: “integral attributes” and “impulse attributes”. It is shown that the combination of statistical and cluster analysis methods increases the accuracy of detecting anomalous events in the CS, and also allows the selection of attributes according to their information significance. The expediency of manual preprocessing of data before clustering is shown by the example of dividing attributes into macrogroups, analyzing the density distribution using violin plot and removing the trend component using the method difference stationary series. With the help of construction of violin diagrams (Violin plot) for the attribute of the “integral” macrogroup, the distribution of states of the CS is shown. It is shown that the removal of the trend component by the DS-series method, normalization and reduction to absolute values allows more accurate marking of anomalous outliers, but this is not always acceptable. The interpretation of the clustering results performed for each normalized attribute shows that the normal values for all attributes are concentrated around zero values. The result of labeling experimental data is attribute-labeled data, where each attribute at the current time is assigned one of two states: abnormal or normal.","PeriodicalId":263691,"journal":{"name":"T-Comm","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"T-Comm","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36724/2072-8735-2021-15-6-40-47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The process of marking multi-attribute experimental data for subsequent use by means of data mining in problems of detection and classification of rare anomalous events of computer systems (CS) is considered. The labeling process is carried out using three methods: manual preprocessing, statistical analysis and cluster analysis. Among the attributes of the metric type, the authors identified two macrogroups: “integral attributes” and “impulse attributes”. It is shown that the combination of statistical and cluster analysis methods increases the accuracy of detecting anomalous events in the CS, and also allows the selection of attributes according to their information significance. The expediency of manual preprocessing of data before clustering is shown by the example of dividing attributes into macrogroups, analyzing the density distribution using violin plot and removing the trend component using the method difference stationary series. With the help of construction of violin diagrams (Violin plot) for the attribute of the “integral” macrogroup, the distribution of states of the CS is shown. It is shown that the removal of the trend component by the DS-series method, normalization and reduction to absolute values allows more accurate marking of anomalous outliers, but this is not always acceptable. The interpretation of the clustering results performed for each normalized attribute shows that the normal values for all attributes are concentrated around zero values. The result of labeling experimental data is attribute-labeled data, where each attribute at the current time is assigned one of two states: abnormal or normal.

查看原文本刊更多论文

利用数据挖掘方法选择计算机系统中罕见异常事件的度量和分类属性

在计算机系统罕见异常事件检测与分类问题中，研究了用数据挖掘方法对多属性实验数据进行标记以供后续使用的过程。标记过程采用人工预处理、统计分析和聚类分析三种方法进行。在度量型属性中，作者确定了两个宏群:“积分属性”和“脉冲属性”。结果表明，统计方法与聚类分析方法的结合提高了CS中异常事件检测的准确性，并允许根据其信息显著性选择属性。通过将属性划分为宏观组，用小提琴图分析密度分布，用差分平稳序列法去除趋势分量的实例，说明了在聚类前对数据进行人工预处理的方便性。通过构造“积分”宏群属性的小提琴图(小提琴图)，给出了CS的状态分布。结果表明，通过DS-series方法去除趋势分量，归一化并减少到绝对值，可以更准确地标记异常异常值，但这并不总是可以接受的。对每个归一化属性执行的聚类结果的解释表明，所有属性的正常值都集中在零值附近。标记实验数据的结果是属性标记数据，其中每个属性在当前时刻被赋予两种状态中的一种:异常或正常。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

T-Comm

自引率

0.00%

发文量