S. Erokhin, B. Borisenko, I. D. Martishin, A. Fadeev
{"title":"Analysis of existing methods to reduce the dimensionality of input data","authors":"S. Erokhin, B. Borisenko, I. D. Martishin, A. Fadeev","doi":"10.36724/2072-8735-2022-16-1-30-37","DOIUrl":null,"url":null,"abstract":"The explosive growth of data arrays, both in the number of records and in attributes, has triggered the development of a number of platforms for handling big data (Amazon Web Services, Google, IBM, Infoworks, Oracle, etc.), as well as parallel algorithms for data analysis (classification, clustering, associative rules). This, in turn, has prompted the use of dimensionality reduction techniques. Feature selection, as a data preprocessing strategy, has proven to be effective and efficient in preparing data (especially high-dimensional data) for various data collection and machine learning tasks. Dimensionality reduction is not only useful for speeding up algorithm execution, but can also help in the final classification/clustering accuracy. Too noisy or even erroneous input data often results in less than desirable algorithm performance. Removing uninformative or low-informative columns of data can actually help the algorithm find more general areas and classification rules and generally achieve better performance. This article discusses commonly used data dimensionality reduction methods and their classification. Data transformation consists of two steps: feature generation and feature selection. A dis tinction is made between scalar feature selection and vector methods (wrapper methods, filtering methods, embedded methods and hybrid methods). Each method has its own advantages and disadvantages, which are outlined in the article. It describes the application of one of the most effective methods of dimensionality reduction - the method of correspondence analysis for CSE-CIC-IDS2018 dataset. The effectiveness of this method in the tasks of dimensionality reduction of the specified dataset in the detection of computer attacks is checked.","PeriodicalId":263691,"journal":{"name":"T-Comm","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"T-Comm","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36724/2072-8735-2022-16-1-30-37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
The explosive growth of data arrays, both in the number of records and in attributes, has triggered the development of a number of platforms for handling big data (Amazon Web Services, Google, IBM, Infoworks, Oracle, etc.), as well as parallel algorithms for data analysis (classification, clustering, associative rules). This, in turn, has prompted the use of dimensionality reduction techniques. Feature selection, as a data preprocessing strategy, has proven to be effective and efficient in preparing data (especially high-dimensional data) for various data collection and machine learning tasks. Dimensionality reduction is not only useful for speeding up algorithm execution, but can also help in the final classification/clustering accuracy. Too noisy or even erroneous input data often results in less than desirable algorithm performance. Removing uninformative or low-informative columns of data can actually help the algorithm find more general areas and classification rules and generally achieve better performance. This article discusses commonly used data dimensionality reduction methods and their classification. Data transformation consists of two steps: feature generation and feature selection. A dis tinction is made between scalar feature selection and vector methods (wrapper methods, filtering methods, embedded methods and hybrid methods). Each method has its own advantages and disadvantages, which are outlined in the article. It describes the application of one of the most effective methods of dimensionality reduction - the method of correspondence analysis for CSE-CIC-IDS2018 dataset. The effectiveness of this method in the tasks of dimensionality reduction of the specified dataset in the detection of computer attacks is checked.
数据数组在记录数量和属性方面的爆炸式增长,引发了许多处理大数据的平台的发展(Amazon Web Services、Google、IBM、Infoworks、Oracle等),以及用于数据分析的并行算法(分类、聚类、关联规则)。这反过来又促进了降维技术的使用。特征选择作为一种数据预处理策略,在为各种数据收集和机器学习任务准备数据(特别是高维数据)方面已被证明是有效和高效的。降维不仅有助于加快算法的执行速度,还有助于提高最终的分类/聚类精度。过多的噪声甚至错误的输入数据通常会导致算法性能低于预期。删除无信息或低信息的数据列实际上可以帮助算法找到更一般的区域和分类规则,通常可以获得更好的性能。本文讨论了常用的数据降维方法及其分类。数据转换包括两个步骤:特征生成和特征选择。对标量特征选择和矢量方法(包装方法、过滤方法、嵌入方法和混合方法)进行了区分。每种方法都有自己的优点和缺点,本文将对其进行概述。描述了最有效的降维方法之一——对应分析方法在CSE-CIC-IDS2018数据集上的应用。验证了该方法在计算机攻击检测中对指定数据集进行降维任务的有效性。