{"title":"Towards data generation to alleviate privacy concerns for cybersecurity applications","authors":"Dhiraj Ganji, Chandranil Chakraborttii","doi":"10.1109/COMPSAC57700.2023.00222","DOIUrl":null,"url":null,"abstract":"While sharing of data is vital for learning progression and knowledge development, its full effectiveness is limited due to concerns about privacy and the presence of stringent regulations. This issue is particularly grave in the domain of cybersecurity applications where client data often comprises confidential and sensitive information. Furthermore, cybersecurity datasets tend to suffer from class imbalance, where data related to cyber attacks are rare compared to the benign conditions. Hence, performing machine learning (ML) tasks such as attack detection and classification becomes a challenging endeavor. Synthetic tabular data has emerged as a viable alternative to enable data sharing while satisfying regulatory and privacy constraints. In this paper, we present a methodology that utilizes the Intrusion Detection System (IDS) dataset to generate synthetic tabular representational data from raw dataset while addressing class imbalance issues during the data generation process. The methodology incorporates a feature selection process that identifies the most important features that help with accurate data generation, and demonstrates comparable performance using popular machine learning (ML) techniques on the anomaly detection task. The similarity between the original and generated datasets is evaluated using two metrics - distribution metric and data reduction metric - achieving up to 0.97 similarity score on the data reduction metric, outperforming a baseline approach that uses all input features by up to 11%.","PeriodicalId":296288,"journal":{"name":"2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSAC57700.2023.00222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
While sharing of data is vital for learning progression and knowledge development, its full effectiveness is limited due to concerns about privacy and the presence of stringent regulations. This issue is particularly grave in the domain of cybersecurity applications where client data often comprises confidential and sensitive information. Furthermore, cybersecurity datasets tend to suffer from class imbalance, where data related to cyber attacks are rare compared to the benign conditions. Hence, performing machine learning (ML) tasks such as attack detection and classification becomes a challenging endeavor. Synthetic tabular data has emerged as a viable alternative to enable data sharing while satisfying regulatory and privacy constraints. In this paper, we present a methodology that utilizes the Intrusion Detection System (IDS) dataset to generate synthetic tabular representational data from raw dataset while addressing class imbalance issues during the data generation process. The methodology incorporates a feature selection process that identifies the most important features that help with accurate data generation, and demonstrates comparable performance using popular machine learning (ML) techniques on the anomaly detection task. The similarity between the original and generated datasets is evaluated using two metrics - distribution metric and data reduction metric - achieving up to 0.97 similarity score on the data reduction metric, outperforming a baseline approach that uses all input features by up to 11%.