Towards data generation to alleviate privacy concerns for cybersecurity applications

2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC) Pub Date : 2023-06-01 DOI:10.1109/COMPSAC57700.2023.00222

Dhiraj Ganji, Chandranil Chakraborttii

{"title":"Towards data generation to alleviate privacy concerns for cybersecurity applications","authors":"Dhiraj Ganji, Chandranil Chakraborttii","doi":"10.1109/COMPSAC57700.2023.00222","DOIUrl":null,"url":null,"abstract":"While sharing of data is vital for learning progression and knowledge development, its full effectiveness is limited due to concerns about privacy and the presence of stringent regulations. This issue is particularly grave in the domain of cybersecurity applications where client data often comprises confidential and sensitive information. Furthermore, cybersecurity datasets tend to suffer from class imbalance, where data related to cyber attacks are rare compared to the benign conditions. Hence, performing machine learning (ML) tasks such as attack detection and classification becomes a challenging endeavor. Synthetic tabular data has emerged as a viable alternative to enable data sharing while satisfying regulatory and privacy constraints. In this paper, we present a methodology that utilizes the Intrusion Detection System (IDS) dataset to generate synthetic tabular representational data from raw dataset while addressing class imbalance issues during the data generation process. The methodology incorporates a feature selection process that identifies the most important features that help with accurate data generation, and demonstrates comparable performance using popular machine learning (ML) techniques on the anomaly detection task. The similarity between the original and generated datasets is evaluated using two metrics - distribution metric and data reduction metric - achieving up to 0.97 similarity score on the data reduction metric, outperforming a baseline approach that uses all input features by up to 11%.","PeriodicalId":296288,"journal":{"name":"2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSAC57700.2023.00222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

While sharing of data is vital for learning progression and knowledge development, its full effectiveness is limited due to concerns about privacy and the presence of stringent regulations. This issue is particularly grave in the domain of cybersecurity applications where client data often comprises confidential and sensitive information. Furthermore, cybersecurity datasets tend to suffer from class imbalance, where data related to cyber attacks are rare compared to the benign conditions. Hence, performing machine learning (ML) tasks such as attack detection and classification becomes a challenging endeavor. Synthetic tabular data has emerged as a viable alternative to enable data sharing while satisfying regulatory and privacy constraints. In this paper, we present a methodology that utilizes the Intrusion Detection System (IDS) dataset to generate synthetic tabular representational data from raw dataset while addressing class imbalance issues during the data generation process. The methodology incorporates a feature selection process that identifies the most important features that help with accurate data generation, and demonstrates comparable performance using popular machine learning (ML) techniques on the anomaly detection task. The similarity between the original and generated datasets is evaluated using two metrics - distribution metric and data reduction metric - achieving up to 0.97 similarity score on the data reduction metric, outperforming a baseline approach that uses all input features by up to 11%.

查看原文本刊更多论文

数据生成，以减轻网络安全应用的隐私问题

虽然数据共享对于学习进展和知识发展至关重要，但由于对隐私和严格监管的担忧，其充分有效性受到限制。这个问题在网络安全应用领域尤为严重，因为客户数据通常包含机密和敏感信息。此外，网络安全数据集往往存在类别不平衡，与良性条件相比，与网络攻击相关的数据很少。因此，执行机器学习(ML)任务(如攻击检测和分类)成为一项具有挑战性的工作。合成表格数据已经成为一种可行的替代方案，可以在满足监管和隐私约束的同时实现数据共享。在本文中，我们提出了一种利用入侵检测系统(IDS)数据集从原始数据集生成合成表格表示数据的方法，同时解决了数据生成过程中的类不平衡问题。该方法结合了一个特征选择过程，该过程可以识别有助于准确生成数据的最重要特征，并在异常检测任务上使用流行的机器学习(ML)技术展示了相当的性能。原始数据集和生成数据集之间的相似性使用两个指标进行评估——分布指标和数据缩减指标——在数据缩减指标上达到0.97的相似性得分，比使用所有输入特征的基线方法高出11%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)

自引率

0.00%

发文量