Combating imbalance in network intrusion datasets

2006 IEEE International Conference on Granular Computing Pub Date : 2006-05-10 DOI:10.1109/GRC.2006.1635905

David A. Cieslak, N. Chawla, A. Striegel

{"title":"Combating imbalance in network intrusion datasets","authors":"David A. Cieslak, N. Chawla, A. Striegel","doi":"10.1109/GRC.2006.1635905","DOIUrl":null,"url":null,"abstract":"An approach to combating network intrusion is the development of systems applying machine learning and data min- ing techniques. Many IDS (Intrusion Detection Systems) suffer from a high rate of false alarms and missed intrusions. We want to be able to improve the intrusion detection rate at a reduced false positive rate. The focus of this paper is rule-learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives. We use RIPPER as the underlying rule classifier. To counter imbalance in data, we implement a combination of oversampling (both by replication and synthetic generation) and undersampling techniques. We also propose a clustering based methodology for oversampling by generating synthetic instances. We evaluate our approaches on two intrusion datasets — destination and actual packets based — constructed from actual Notre Dame traffic, giving a flavor of real-world data with its idiosyncrasies. Using ROC analysis, we show that oversampling by synthetic generation of minority (intrusion) class outperforms oversampling by replication and RIPPER's loss ratio method. Additionally, we establish that our clustering based approach is more suitable for the detecting intrusions and is able to provide additional improvement over just synthetic generation of instances.","PeriodicalId":400997,"journal":{"name":"2006 IEEE International Conference on Granular Computing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"242","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE International Conference on Granular Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRC.2006.1635905","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 242

Abstract

An approach to combating network intrusion is the development of systems applying machine learning and data min- ing techniques. Many IDS (Intrusion Detection Systems) suffer from a high rate of false alarms and missed intrusions. We want to be able to improve the intrusion detection rate at a reduced false positive rate. The focus of this paper is rule-learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives. We use RIPPER as the underlying rule classifier. To counter imbalance in data, we implement a combination of oversampling (both by replication and synthetic generation) and undersampling techniques. We also propose a clustering based methodology for oversampling by generating synthetic instances. We evaluate our approaches on two intrusion datasets — destination and actual packets based — constructed from actual Notre Dame traffic, giving a flavor of real-world data with its idiosyncrasies. Using ROC analysis, we show that oversampling by synthetic generation of minority (intrusion) class outperforms oversampling by replication and RIPPER's loss ratio method. Additionally, we establish that our clustering based approach is more suitable for the detecting intrusions and is able to provide additional improvement over just synthetic generation of instances.

查看原文本刊更多论文

对抗网络入侵数据集的不平衡

对抗网络入侵的一种方法是开发应用机器学习和数据挖掘技术的系统。许多入侵检测系统都存在较高的误报率和漏报率。我们希望能够在降低误报率的情况下提高入侵检测率。本文的重点是使用RIPPER对高度不平衡的入侵数据集进行规则学习，目的是在不显著增加假阳性的情况下提高真阳性率(入侵)。我们使用RIPPER作为底层规则分类器。为了对抗数据的不平衡，我们实现了过采样(通过复制和合成生成)和欠采样技术的组合。我们还提出了一种基于聚类的方法，通过生成合成实例来进行过采样。我们在两个入侵数据集上评估了我们的方法——基于目的地和基于实际数据包的入侵数据集——这些数据集是由实际的圣母大学流量构建的，给人一种带有其特质的真实数据的感觉。通过ROC分析，我们发现采用合成生成少数派(入侵)类的过采样方法优于采用复制和RIPPER损失比法的过采样方法。此外，我们建立了基于聚类的方法更适合于检测入侵，并且能够提供比仅仅合成生成实例更多的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2006 IEEE International Conference on Granular Computing

自引率

0.00%

发文量