Three-stage data generation algorithm for multiclass network intrusion detection with highly imbalanced dataset

Kwok Tai Chui , Brij B. Gupta , Priyanka Chaurasia , Varsha Arya , Ammar Almomani , Wadee Alhalabi
{"title":"Three-stage data generation algorithm for multiclass network intrusion detection with highly imbalanced dataset","authors":"Kwok Tai Chui ,&nbsp;Brij B. Gupta ,&nbsp;Priyanka Chaurasia ,&nbsp;Varsha Arya ,&nbsp;Ammar Almomani ,&nbsp;Wadee Alhalabi","doi":"10.1016/j.ijin.2023.08.001","DOIUrl":null,"url":null,"abstract":"<div><p>The Internet plays a crucial role in our daily routines. Ensuring cybersecurity to Internet users will provide a safe online environment. Automatic network intrusion detection (NID) using machine learning algorithms has recently received increased attention recently. The NID model is prone to bias towards the classes with more training samples due to highly imbalanced datasets across different types of attacks. The challenge in generating additional training data for minority classes is the generation of insufficient data. The study's purpose is to address this challenge, which extends the data generation ability by proposing a three-stage data generation algorithm using the synthetic minority over-sampling technique, a generative adversarial network (GAN), and a variational autoencoder. A convolutional neural network is employed to extract the representative features from the data, which were fed into a support vector machine with a customised kernel function. An ablation study evaluated the effectiveness of the three-stage data generation, feature extraction, and customised kernel. This was followed by a performance comparison between our study and existing studies. The findings revealed that the proposed NID model achieved an accuracy of 91.9%–96.2% in the four benchmark datasets. In addition, it outperformed existing methods such as GAN-based deep neural networks, conditional Wasserstein GAN-based stacked autoencoder, synthesised minority oversampling technique-based random forest, and variational autoencoder-based deep neural network, by 1.51%–28.4%.</p></div>","PeriodicalId":100702,"journal":{"name":"International Journal of Intelligent Networks","volume":"4 ","pages":"Pages 202-210"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Intelligent Networks","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666603023000209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The Internet plays a crucial role in our daily routines. Ensuring cybersecurity to Internet users will provide a safe online environment. Automatic network intrusion detection (NID) using machine learning algorithms has recently received increased attention recently. The NID model is prone to bias towards the classes with more training samples due to highly imbalanced datasets across different types of attacks. The challenge in generating additional training data for minority classes is the generation of insufficient data. The study's purpose is to address this challenge, which extends the data generation ability by proposing a three-stage data generation algorithm using the synthetic minority over-sampling technique, a generative adversarial network (GAN), and a variational autoencoder. A convolutional neural network is employed to extract the representative features from the data, which were fed into a support vector machine with a customised kernel function. An ablation study evaluated the effectiveness of the three-stage data generation, feature extraction, and customised kernel. This was followed by a performance comparison between our study and existing studies. The findings revealed that the proposed NID model achieved an accuracy of 91.9%–96.2% in the four benchmark datasets. In addition, it outperformed existing methods such as GAN-based deep neural networks, conditional Wasserstein GAN-based stacked autoencoder, synthesised minority oversampling technique-based random forest, and variational autoencoder-based deep neural network, by 1.51%–28.4%.

基于高度不平衡数据集的多类网络入侵检测的三阶段数据生成算法
互联网在我们的日常生活中起着至关重要的作用。确保互联网用户的网络安全将提供一个安全的在线环境。最近,使用机器学习算法的自动网络入侵检测(NID)受到了越来越多的关注。由于不同类型攻击的数据集高度不平衡,NID模型容易偏向于具有更多训练样本的类。为少数族裔班级生成额外培训数据的挑战是生成的数据不足。该研究的目的是解决这一挑战,通过提出一种使用合成少数过采样技术、生成对抗性网络(GAN)和变分自动编码器的三阶段数据生成算法,扩展了数据生成能力。卷积神经网络用于从数据中提取代表性特征,这些特征被输入到具有定制核函数的支持向量机中。一项消融研究评估了三阶段数据生成、特征提取和定制内核的有效性。随后对我们的研究和现有研究进行了性能比较。研究结果表明,所提出的NID模型在四个基准数据集中的准确率为91.9%-96.2%。此外,它的性能优于现有的方法,如基于GAN的深度神经网络、基于条件Wasserstein GAN的堆叠式自动编码器、基于随机森林的合成少数过采样技术和基于变分自动编码器的深度神经网,提高了1.51%-28.4%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
12.00
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信