{"title":"Optimizing Intrusion Detection Systems in Three Phases on the CSE-CIC-IDS-2018 Dataset","authors":"Surasit Songma, Theera Sathuphan, Thanakorn Pamutha","doi":"10.3390/computers12120245","DOIUrl":null,"url":null,"abstract":"This article examines intrusion detection systems in depth using the CSE-CIC-IDS-2018 dataset. The investigation is divided into three stages: to begin, data cleaning, exploratory data analysis, and data normalization procedures (min-max and Z-score) are used to prepare data for use with various classifiers; second, in order to improve processing speed and reduce model complexity, a combination of principal component analysis (PCA) and random forest (RF) is used to reduce non-significant features by comparing them to the full dataset; finally, machine learning methods (XGBoost, CART, DT, KNN, MLP, RF, LR, and Bayes) are applied to specific features and preprocessing procedures, with the XGBoost, DT, and RF models outperforming the others in terms of both ROC values and CPU runtime. The evaluation concludes with the discovery of an optimal set, which includes PCA and RF feature selection.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"62 37","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12120245","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
This article examines intrusion detection systems in depth using the CSE-CIC-IDS-2018 dataset. The investigation is divided into three stages: to begin, data cleaning, exploratory data analysis, and data normalization procedures (min-max and Z-score) are used to prepare data for use with various classifiers; second, in order to improve processing speed and reduce model complexity, a combination of principal component analysis (PCA) and random forest (RF) is used to reduce non-significant features by comparing them to the full dataset; finally, machine learning methods (XGBoost, CART, DT, KNN, MLP, RF, LR, and Bayes) are applied to specific features and preprocessing procedures, with the XGBoost, DT, and RF models outperforming the others in terms of both ROC values and CPU runtime. The evaluation concludes with the discovery of an optimal set, which includes PCA and RF feature selection.