Enhanced intrusion detection model based on principal component analysis and variable ensemble machine learning algorithm

Intelligent Systems with Applications Pub Date : 2024-09-21 DOI:10.1016/j.iswa.2024.200442

Ayuba John , Ismail Fauzi Bin Isnin , Syed Hamid Hussain Madni , Farkhana Binti Muchtar

{"title":"Enhanced intrusion detection model based on principal component analysis and variable ensemble machine learning algorithm","authors":"Ayuba John , Ismail Fauzi Bin Isnin , Syed Hamid Hussain Madni , Farkhana Binti Muchtar","doi":"10.1016/j.iswa.2024.200442","DOIUrl":null,"url":null,"abstract":"<div><div>The intrusion detection system (IDS) model, which can identify the presence of intruders in the network and take some predefined action for safe data transit across the network, is advantageous in achieving security in both simple and advanced network systems. Several IDS models have various security problems, such as low detection accuracy and high false alarms, which can be caused by the network traffic dataset's excessive dimensionality and class imbalance in the creation of IDS models. Principal Component Analysis (PCA) has proven to be a helpful feature selection technique for dimensionality reduction. As a result, because it is a linear transformation, it has challenges capturing non-linear relationships between feature properties in the network traffic datasets. This paper proposes a variable ensemble machine learning method to solve the problem and achieve a low variance model with high accuracy and low false alarm. First, PCA is combined with the AdaBoost ensemble machine learning algorithm, which acts as stagewise additive modelling to compensate for PCA's deficiency in feature selection in network traffic by minimizing the exponential loss function. Secondly, PCA is used for feature selection, and a LogitBoost classifier algorithm can be used for multiclass classification and acts as an additive tree regression to compensate for the PCA's weakness by minimizing the Logistic Loss to provide an optimal classifier output. Finally, the low variance ability of RandomForest, which employs the bagging approach, is applied to eliminate overfittings. The experiments of the IDS model developed from the proposed methods were evaluated on the WSN-DS, NSL-KDD, and UNSW-N15 datasets. The performance of the methods, PCA with AdaBoost, on the WSN-DS dataset has an accuracy score of 92.3 %, an 89.0 % accuracy score on the NSL-KDD dataset, and a 67.9 % accuracy score on UNSW-N15, which is the least accurate score. PCA and RandomForest surpassed them by scoring 100 % accuracy on all three datasets. PCA and Bagging have an accuracy score of 99.8 % on the WSN-DS dataset, 100 % on the NSL-KDD dataset, and 93.4 % on the UNSW-N15 dataset. In comparison, PCA and LogitBoost have an accuracy score of 98.9 % on the WSN-DS dataset, 100 % on the NSL-KDD dataset, and 88.7 % on the UNSW-N15 dataset.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"24 ","pages":"Article 200442"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324001169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The intrusion detection system (IDS) model, which can identify the presence of intruders in the network and take some predefined action for safe data transit across the network, is advantageous in achieving security in both simple and advanced network systems. Several IDS models have various security problems, such as low detection accuracy and high false alarms, which can be caused by the network traffic dataset's excessive dimensionality and class imbalance in the creation of IDS models. Principal Component Analysis (PCA) has proven to be a helpful feature selection technique for dimensionality reduction. As a result, because it is a linear transformation, it has challenges capturing non-linear relationships between feature properties in the network traffic datasets. This paper proposes a variable ensemble machine learning method to solve the problem and achieve a low variance model with high accuracy and low false alarm. First, PCA is combined with the AdaBoost ensemble machine learning algorithm, which acts as stagewise additive modelling to compensate for PCA's deficiency in feature selection in network traffic by minimizing the exponential loss function. Secondly, PCA is used for feature selection, and a LogitBoost classifier algorithm can be used for multiclass classification and acts as an additive tree regression to compensate for the PCA's weakness by minimizing the Logistic Loss to provide an optimal classifier output. Finally, the low variance ability of RandomForest, which employs the bagging approach, is applied to eliminate overfittings. The experiments of the IDS model developed from the proposed methods were evaluated on the WSN-DS, NSL-KDD, and UNSW-N15 datasets. The performance of the methods, PCA with AdaBoost, on the WSN-DS dataset has an accuracy score of 92.3 %, an 89.0 % accuracy score on the NSL-KDD dataset, and a 67.9 % accuracy score on UNSW-N15, which is the least accurate score. PCA and RandomForest surpassed them by scoring 100 % accuracy on all three datasets. PCA and Bagging have an accuracy score of 99.8 % on the WSN-DS dataset, 100 % on the NSL-KDD dataset, and 93.4 % on the UNSW-N15 dataset. In comparison, PCA and LogitBoost have an accuracy score of 98.9 % on the WSN-DS dataset, 100 % on the NSL-KDD dataset, and 88.7 % on the UNSW-N15 dataset.

查看原文本刊更多论文

基于主成分分析和变量集合机器学习算法的增强型入侵检测模型

入侵检测系统（IDS）模型可以识别网络中是否存在入侵者，并采取一些预定义的措施以确保数据在网络中的安全传输，它在实现简单和高级网络系统的安全性方面都具有优势。一些 IDS 模型存在各种安全问题，如检测准确率低和误报率高，这可能是由于创建 IDS 模型时网络流量数据集的维度过大和类不平衡造成的。事实证明，主成分分析（PCA）是一种有助于降维的特征选择技术。但由于它是一种线性变换，因此在捕捉网络流量数据集中特征属性之间的非线性关系方面存在挑战。本文提出了一种变量集合机器学习方法来解决这一问题，并实现了高精度、低误报的低方差模型。首先，将 PCA 与 AdaBoost 集合机器学习算法相结合，通过最小化指数损失函数，发挥阶段性加法建模的作用，弥补 PCA 在网络流量特征选择方面的不足。其次，PCA 用于特征选择，LogitBoost 分类器算法可用于多类分类，作为加法树回归，通过最小化 Logistic 损失来弥补 PCA 的不足，从而提供最佳分类器输出。最后，随机森林（RandomForest）的低方差能力采用了袋集方法，以消除过拟合。在 WSN-DS、NSL-KDD 和 UNSW-N15 数据集上对根据所提方法开发的 IDS 模型进行了实验评估。PCA 和 AdaBoost 方法在 WSN-DS 数据集上的准确率为 92.3%，在 NSL-KDD 数据集上的准确率为 89.0%，在 UNSW-N15 数据集上的准确率为 67.9%，是准确率最低的数据集。PCA 和 RandomForest 在这三个数据集上的准确率都达到了 100%，超过了它们。PCA 和 Bagging 在 WSN-DS 数据集上的准确率为 99.8%，在 NSL-KDD 数据集上为 100%，在 UNSW-N15 数据集上为 93.4%。相比之下，PCA 和 LogitBoost 在 WSN-DS 数据集上的准确率为 98.9%，在 NSL-KDD 数据集上为 100%，在 UNSW-N15 数据集上为 88.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent Systems with Applications

CiteScore

5.60

自引率

0.00%

发文量