{"title":"Predicting Cyberattacks with Destination Port Through Various Input Feature Scenario","authors":"R. Zuech, John T. Hancock, T. Khoshgoftaar","doi":"10.1142/s0218539322500036","DOIUrl":null,"url":null,"abstract":"When analyzing cybersecurity datasets with machine learning, researchers commonly need to consider whether or not to include Destination Port as an input feature. We assess the impact of Destination Port as a predictive feature by building predictive models with three different input feature sets and four combinations of web attacks from the CSE-CIC-IDS2018 dataset. First, we use Destination Port as the only (single) input feature to our models. Second, all features (from CSE-CIC-IDS2018) are used without Destination Port to build the models. Third, all features plus (including) Destination Port are used to train and test the models. All three of these feature sets obtain respectable classification results in detecting web attacks with LightGBM and CatBoost classifiers in terms of Area Under the Receiver Operating Characteristic Curve (AUC) scores, with AUC scores exceeding 0.90 for all scenarios. We observe the best classification performance scores when Destination Port is combined with all of the other CSE-CIC-IDS2018 features. Although, classification performance is still respectable when only using Destination Port as the only (single) input feature. Additionally, we validate that Botnet attacks also have respectable AUC with Destination Port as the only input feature to our models. This highlights that practitioners must be mindful of whether or not to include Destination Port as an input feature if it experiences lopsided label distributions as we clearly identify in this study. Our brief survey of existing CSE-CIC-IDS2018 literature also discovered that many studies incorrectly treat Destination Port as a numerical input feature with machine learning models. Destination Port should be treated as a categorical input value to machine learning models, as its values do not represent numerical values which can be used in mathematical equations for the models.","PeriodicalId":45573,"journal":{"name":"International Journal of Reliability Quality and Safety Engineering","volume":"28 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2022-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Reliability Quality and Safety Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0218539322500036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
When analyzing cybersecurity datasets with machine learning, researchers commonly need to consider whether or not to include Destination Port as an input feature. We assess the impact of Destination Port as a predictive feature by building predictive models with three different input feature sets and four combinations of web attacks from the CSE-CIC-IDS2018 dataset. First, we use Destination Port as the only (single) input feature to our models. Second, all features (from CSE-CIC-IDS2018) are used without Destination Port to build the models. Third, all features plus (including) Destination Port are used to train and test the models. All three of these feature sets obtain respectable classification results in detecting web attacks with LightGBM and CatBoost classifiers in terms of Area Under the Receiver Operating Characteristic Curve (AUC) scores, with AUC scores exceeding 0.90 for all scenarios. We observe the best classification performance scores when Destination Port is combined with all of the other CSE-CIC-IDS2018 features. Although, classification performance is still respectable when only using Destination Port as the only (single) input feature. Additionally, we validate that Botnet attacks also have respectable AUC with Destination Port as the only input feature to our models. This highlights that practitioners must be mindful of whether or not to include Destination Port as an input feature if it experiences lopsided label distributions as we clearly identify in this study. Our brief survey of existing CSE-CIC-IDS2018 literature also discovered that many studies incorrectly treat Destination Port as a numerical input feature with machine learning models. Destination Port should be treated as a categorical input value to machine learning models, as its values do not represent numerical values which can be used in mathematical equations for the models.
期刊介绍:
IJRQSE is a refereed journal focusing on both the theoretical and practical aspects of reliability, quality, and safety in engineering. The journal is intended to cover a broad spectrum of issues in manufacturing, computing, software, aerospace, control, nuclear systems, power systems, communication systems, and electronics. Papers are sought in the theoretical domain as well as in such practical fields as industry and laboratory research. The journal is published quarterly, March, June, September and December. It is intended to bridge the gap between the theoretical experts and practitioners in the academic, scientific, government, and business communities.