Effects of feature selection and normalization on network intrusion detection

Mubarak Albarka Umar , Zhanfang Chen , Khaled Shuaib , Yan Liu
{"title":"Effects of feature selection and normalization on network intrusion detection","authors":"Mubarak Albarka Umar ,&nbsp;Zhanfang Chen ,&nbsp;Khaled Shuaib ,&nbsp;Yan Liu","doi":"10.1016/j.dsm.2024.08.001","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence (AI) techniques (such as machine learning (ML) and deep learning (DL)) to build more efficient and reliable intrusion detection systems (IDSs). However, the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs. Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues. While most of these researchers reported the success of these preprocessing techniques on a shallow level, very few studies have been performed on their effects on a wider scale. Furthermore, the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used, which most of the existing studies give little emphasis on. Thus, this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets: NSL-KDD, UNSW-NB15, and CSE–CIC–IDS2018, and various AI algorithms. A wrapper-based approach, which tends to give superior performance, and min-max normalization methods were used for feature selection and normalization, respectively. Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization. The models were evaluated using popular evaluation metrics in IDS modeling, intra- and inter-model comparisons were performed between models and with state-of-the-art works. Random forest (RF) models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86% and 96.01%, respectively, whereas artificial neural network (ANN) achieved the best accuracy of 95.43% on the CSE–CIC–IDS2018 dataset. The RF models also achieved an excellent performance compared to recent works. The results show that normalization and feature selection positively affect IDS modeling. Furthermore, while feature selection benefits simpler algorithms (such as RF), normalization is more useful for complex algorithms like ANNs and DNNs, and algorithms such as NB are unsuitable for IDS modeling. The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset. Our findings suggest that prioritizing robust algorithms like RF, alongside complex models such as ANN and DNN, can significantly enhance IDS performance. These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.</div></div>","PeriodicalId":100353,"journal":{"name":"Data Science and Management","volume":"8 1","pages":"Pages 23-39"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666764924000390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence (AI) techniques (such as machine learning (ML) and deep learning (DL)) to build more efficient and reliable intrusion detection systems (IDSs). However, the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs. Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues. While most of these researchers reported the success of these preprocessing techniques on a shallow level, very few studies have been performed on their effects on a wider scale. Furthermore, the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used, which most of the existing studies give little emphasis on. Thus, this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets: NSL-KDD, UNSW-NB15, and CSE–CIC–IDS2018, and various AI algorithms. A wrapper-based approach, which tends to give superior performance, and min-max normalization methods were used for feature selection and normalization, respectively. Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization. The models were evaluated using popular evaluation metrics in IDS modeling, intra- and inter-model comparisons were performed between models and with state-of-the-art works. Random forest (RF) models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86% and 96.01%, respectively, whereas artificial neural network (ANN) achieved the best accuracy of 95.43% on the CSE–CIC–IDS2018 dataset. The RF models also achieved an excellent performance compared to recent works. The results show that normalization and feature selection positively affect IDS modeling. Furthermore, while feature selection benefits simpler algorithms (such as RF), normalization is more useful for complex algorithms like ANNs and DNNs, and algorithms such as NB are unsuitable for IDS modeling. The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset. Our findings suggest that prioritizing robust algorithms like RF, alongside complex models such as ANN and DNN, can significantly enhance IDS performance. These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates.
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.50
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信