Using CGAN to Deal with Class Imbalance and Small Sample Size in Cybersecurity Problems

2021 18th International Conference on Privacy, Security and Trust (PST) Pub Date : 2021-12-13 DOI:10.1109/PST52912.2021.9647807

Ehsan Nazari, Paula Branco, Guy-Vincent Jourdan

{"title":"Using CGAN to Deal with Class Imbalance and Small Sample Size in Cybersecurity Problems","authors":"Ehsan Nazari, Paula Branco, Guy-Vincent Jourdan","doi":"10.1109/PST52912.2021.9647807","DOIUrl":null,"url":null,"abstract":"Predictive modelling in cybersecurity domains usually involves dealing with complex settings. The class imbalance problem is a well-know challenge typically present in the cybersecurity domain. For instance, in a real-world intrusion detection scenario, the number of attacks is expected to be a a very small percentage of the normal cases. Moreover, in these applications, the number of available examples labelled is also small due to the complexity and cost of the labelling process: teams of domain experts need to be involved in the process which becomes expensive, time consuming and prone to errors. To address these problems is critical to the success of predictive modelling in cybersecurity applications. In this paper we tackle the class imbalance and small sample size through the use of a CGAN-based up-sampling procedure. We carry out an extensive set of experiments that show the positive impact of applying this solution to address the class imbalance and small sample size problems. A large data repository is built and freely provided to the research community containing 114 binary datasets based on real-world cybersecurity problems that are generated with diversified levels of imbalance and sample size. Our experiments show a clear advantage of using the CGAN-based up-sampling method specially for situations where the sample size is small and there is a large imbalance between the problem classes. In the most critical scenarios associated with extreme rarity and very small sample size, an impressive performance boost is achieved. We also explore the behaviour of this approach when the presence of these problems is less marked and we found that, while CGAN-based up-sampling is not able to further improve the minority class performance, it also has no negative impact. Thus, it is a safe to use solution, also in these scenarios.","PeriodicalId":144610,"journal":{"name":"2021 18th International Conference on Privacy, Security and Trust (PST)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 18th International Conference on Privacy, Security and Trust (PST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PST52912.2021.9647807","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Predictive modelling in cybersecurity domains usually involves dealing with complex settings. The class imbalance problem is a well-know challenge typically present in the cybersecurity domain. For instance, in a real-world intrusion detection scenario, the number of attacks is expected to be a a very small percentage of the normal cases. Moreover, in these applications, the number of available examples labelled is also small due to the complexity and cost of the labelling process: teams of domain experts need to be involved in the process which becomes expensive, time consuming and prone to errors. To address these problems is critical to the success of predictive modelling in cybersecurity applications. In this paper we tackle the class imbalance and small sample size through the use of a CGAN-based up-sampling procedure. We carry out an extensive set of experiments that show the positive impact of applying this solution to address the class imbalance and small sample size problems. A large data repository is built and freely provided to the research community containing 114 binary datasets based on real-world cybersecurity problems that are generated with diversified levels of imbalance and sample size. Our experiments show a clear advantage of using the CGAN-based up-sampling method specially for situations where the sample size is small and there is a large imbalance between the problem classes. In the most critical scenarios associated with extreme rarity and very small sample size, an impressive performance boost is achieved. We also explore the behaviour of this approach when the presence of these problems is less marked and we found that, while CGAN-based up-sampling is not able to further improve the minority class performance, it also has no negative impact. Thus, it is a safe to use solution, also in these scenarios.

查看原文本刊更多论文

用CGAN处理网络安全问题中的类不平衡和小样本量

网络安全领域的预测建模通常涉及处理复杂的设置。类不平衡问题是网络安全领域中一个众所周知的挑战。例如，在真实的入侵检测场景中，攻击的数量预计只占正常情况的很小比例。此外，在这些应用中，由于标记过程的复杂性和成本，标记的可用示例数量也很少:领域专家团队需要参与这个过程，这变得昂贵，耗时且容易出错。解决这些问题对于网络安全应用中预测建模的成功至关重要。在本文中，我们通过使用基于cgan的上采样过程来解决类不平衡和小样本量问题。我们进行了一组广泛的实验，显示了应用该解决方案来解决类不平衡和小样本量问题的积极影响。建立了一个大型数据存储库，并免费提供给研究社区，其中包含114个基于现实世界网络安全问题的二进制数据集，这些数据集产生了不同程度的不平衡和样本量。我们的实验表明，使用基于cgan的上采样方法具有明显的优势，特别是在样本量较小且问题类别之间存在较大不平衡的情况下。在与极端稀有和非常小的样本量相关的最关键场景中，可以实现令人印象深刻的性能提升。我们还探索了这种方法在这些问题不太明显时的行为，我们发现，虽然基于cgan的上采样不能进一步提高少数类的性能，但它也没有负面影响。因此，使用该解决方案是安全的，在这些场景中也是如此。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 18th International Conference on Privacy, Security and Trust (PST)

自引率

0.00%

发文量