Effects of feature selection and normalization on network intrusion detection

Mubarak Albarka Umar , Zhanfang Chen , Khaled Shuaib , Yan Liu
{"title":"Effects of feature selection and normalization on network intrusion detection","authors":"Mubarak Albarka Umar ,&nbsp;Zhanfang Chen ,&nbsp;Khaled Shuaib ,&nbsp;Yan Liu","doi":"10.1016/j.dsm.2024.08.001","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence (AI) techniques (such as machine learning (ML) and deep learning (DL)) to build more efficient and reliable intrusion detection systems (IDSs). However, the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs. Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues. While most of these researchers reported the success of these preprocessing techniques on a shallow level, very few studies have been performed on their effects on a wider scale. Furthermore, the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used, which most of the existing studies give little emphasis on. Thus, this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets: NSL-KDD, UNSW-NB15, and CSE–CIC–IDS2018, and various AI algorithms. A wrapper-based approach, which tends to give superior performance, and min-max normalization methods were used for feature selection and normalization, respectively. Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization. The models were evaluated using popular evaluation metrics in IDS modeling, intra- and inter-model comparisons were performed between models and with state-of-the-art works. Random forest (RF) models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86% and 96.01%, respectively, whereas artificial neural network (ANN) achieved the best accuracy of 95.43% on the CSE–CIC–IDS2018 dataset. The RF models also achieved an excellent performance compared to recent works. The results show that normalization and feature selection positively affect IDS modeling. Furthermore, while feature selection benefits simpler algorithms (such as RF), normalization is more useful for complex algorithms like ANNs and DNNs, and algorithms such as NB are unsuitable for IDS modeling. The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset. Our findings suggest that prioritizing robust algorithms like RF, alongside complex models such as ANN and DNN, can significantly enhance IDS performance. These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.</div></div>","PeriodicalId":100353,"journal":{"name":"Data Science and Management","volume":"8 1","pages":"Pages 23-39"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666764924000390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence (AI) techniques (such as machine learning (ML) and deep learning (DL)) to build more efficient and reliable intrusion detection systems (IDSs). However, the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs. Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues. While most of these researchers reported the success of these preprocessing techniques on a shallow level, very few studies have been performed on their effects on a wider scale. Furthermore, the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used, which most of the existing studies give little emphasis on. Thus, this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets: NSL-KDD, UNSW-NB15, and CSE–CIC–IDS2018, and various AI algorithms. A wrapper-based approach, which tends to give superior performance, and min-max normalization methods were used for feature selection and normalization, respectively. Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization. The models were evaluated using popular evaluation metrics in IDS modeling, intra- and inter-model comparisons were performed between models and with state-of-the-art works. Random forest (RF) models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86% and 96.01%, respectively, whereas artificial neural network (ANN) achieved the best accuracy of 95.43% on the CSE–CIC–IDS2018 dataset. The RF models also achieved an excellent performance compared to recent works. The results show that normalization and feature selection positively affect IDS modeling. Furthermore, while feature selection benefits simpler algorithms (such as RF), normalization is more useful for complex algorithms like ANNs and DNNs, and algorithms such as NB are unsuitable for IDS modeling. The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset. Our findings suggest that prioritizing robust algorithms like RF, alongside complex models such as ANN and DNN, can significantly enhance IDS performance. These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.
特征选择和归一化对网络入侵检测的影响
网络攻击的迅速增加以及传统防御系统和方法的逐渐失败导致使用人工智能(AI)技术(如机器学习(ML)和深度学习(DL))来构建更高效和可靠的入侵检测系统(ids)。然而,更大的IDS数据集的出现对基于ai的IDS的性能和计算复杂度产生了负面影响。许多研究者使用数据预处理技术如特征选择和归一化来克服这些问题。虽然这些研究人员大多报告了这些预处理技术在浅层上的成功,但很少有研究在更广泛的范围内对它们的影响进行研究。此外,IDS模型的性能不仅取决于所使用的预处理技术,还取决于所使用的数据集和ML/DL算法,而大多数现有研究都很少强调这一点。因此,本研究对基于NSL-KDD、UNSW-NB15和CSE-CIC-IDS2018三个IDS数据集以及不同AI算法构建的IDS模型的特征选择和归一化效果进行了深入分析。在特征选择和归一化方面,分别采用了基于包装器的方法和最小-最大归一化方法。使用数据集的完整副本和特征选择副本实现了许多IDS模型,这些数据集有或没有规范化。使用IDS建模中流行的评估指标对模型进行评估,在模型之间进行模型内部和模型之间的比较,并与最先进的作品进行比较。随机森林(RF)模型在NSL-KDD和UNSW-NB15数据集上的准确率分别为99.86%和96.01%,而人工神经网络(ANN)在CSE-CIC-IDS2018数据集上的准确率为95.43%。与最近的工作相比,RF模型也取得了出色的性能。结果表明,归一化和特征选择对IDS建模有积极影响。此外,虽然特征选择有利于更简单的算法(如RF),但归一化对于像ann和dnn这样的复杂算法更有用,而像NB这样的算法不适合IDS建模。该研究还发现,UNSW-NB15和CSE-CIC-IDS2018数据集比NSL-KDD数据集更复杂,更适合构建和评估现代IDS。我们的研究结果表明,优先考虑RF等鲁棒算法,以及ANN和DNN等复杂模型,可以显著提高IDS的性能。这些见解为管理人员提供了有价值的指导,可以通过关注高检测率和低错误警报率来开发更有效的安全措施。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.50
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信