Analysis of existing methods to reduce the dimensionality of input data

S. Erokhin, B. Borisenko, I. D. Martishin, A. Fadeev
{"title":"Analysis of existing methods to reduce the dimensionality of input data","authors":"S. Erokhin, B. Borisenko, I. D. Martishin, A. Fadeev","doi":"10.36724/2072-8735-2022-16-1-30-37","DOIUrl":null,"url":null,"abstract":"The explosive growth of data arrays, both in the number of records and in attributes, has triggered the development of a number of platforms for handling big data (Amazon Web Services, Google, IBM, Infoworks, Oracle, etc.), as well as parallel algorithms for data analysis (classification, clustering, associative rules). This, in turn, has prompted the use of dimensionality reduction techniques. Feature selection, as a data preprocessing strategy, has proven to be effective and efficient in preparing data (especially high-dimensional data) for various data collection and machine learning tasks. Dimensionality reduction is not only useful for speeding up algorithm execution, but can also help in the final classification/clustering accuracy. Too noisy or even erroneous input data often results in less than desirable algorithm performance. Removing uninformative or low-informative columns of data can actually help the algorithm find more general areas and classification rules and generally achieve better performance. This article discusses commonly used data dimensionality reduction methods and their classification. Data transformation consists of two steps: feature generation and feature selection. A dis tinction is made between scalar feature selection and vector methods (wrapper methods, filtering methods, embedded methods and hybrid methods). Each method has its own advantages and disadvantages, which are outlined in the article. It describes the application of one of the most effective methods of dimensionality reduction - the method of correspondence analysis for CSE-CIC-IDS2018 dataset. The effectiveness of this method in the tasks of dimensionality reduction of the specified dataset in the detection of computer attacks is checked.","PeriodicalId":263691,"journal":{"name":"T-Comm","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"T-Comm","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36724/2072-8735-2022-16-1-30-37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The explosive growth of data arrays, both in the number of records and in attributes, has triggered the development of a number of platforms for handling big data (Amazon Web Services, Google, IBM, Infoworks, Oracle, etc.), as well as parallel algorithms for data analysis (classification, clustering, associative rules). This, in turn, has prompted the use of dimensionality reduction techniques. Feature selection, as a data preprocessing strategy, has proven to be effective and efficient in preparing data (especially high-dimensional data) for various data collection and machine learning tasks. Dimensionality reduction is not only useful for speeding up algorithm execution, but can also help in the final classification/clustering accuracy. Too noisy or even erroneous input data often results in less than desirable algorithm performance. Removing uninformative or low-informative columns of data can actually help the algorithm find more general areas and classification rules and generally achieve better performance. This article discusses commonly used data dimensionality reduction methods and their classification. Data transformation consists of two steps: feature generation and feature selection. A dis tinction is made between scalar feature selection and vector methods (wrapper methods, filtering methods, embedded methods and hybrid methods). Each method has its own advantages and disadvantages, which are outlined in the article. It describes the application of one of the most effective methods of dimensionality reduction - the method of correspondence analysis for CSE-CIC-IDS2018 dataset. The effectiveness of this method in the tasks of dimensionality reduction of the specified dataset in the detection of computer attacks is checked.
分析现有的降低输入数据维数的方法
数据数组在记录数量和属性方面的爆炸式增长,引发了许多处理大数据的平台的发展(Amazon Web Services、Google、IBM、Infoworks、Oracle等),以及用于数据分析的并行算法(分类、聚类、关联规则)。这反过来又促进了降维技术的使用。特征选择作为一种数据预处理策略,在为各种数据收集和机器学习任务准备数据(特别是高维数据)方面已被证明是有效和高效的。降维不仅有助于加快算法的执行速度,还有助于提高最终的分类/聚类精度。过多的噪声甚至错误的输入数据通常会导致算法性能低于预期。删除无信息或低信息的数据列实际上可以帮助算法找到更一般的区域和分类规则,通常可以获得更好的性能。本文讨论了常用的数据降维方法及其分类。数据转换包括两个步骤:特征生成和特征选择。对标量特征选择和矢量方法(包装方法、过滤方法、嵌入方法和混合方法)进行了区分。每种方法都有自己的优点和缺点,本文将对其进行概述。描述了最有效的降维方法之一——对应分析方法在CSE-CIC-IDS2018数据集上的应用。验证了该方法在计算机攻击检测中对指定数据集进行降维任务的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信