{"title":"Combating imbalance in network intrusion datasets","authors":"David A. Cieslak, N. Chawla, A. Striegel","doi":"10.1109/GRC.2006.1635905","DOIUrl":null,"url":null,"abstract":"An approach to combating network intrusion is the development of systems applying machine learning and data min- ing techniques. Many IDS (Intrusion Detection Systems) suffer from a high rate of false alarms and missed intrusions. We want to be able to improve the intrusion detection rate at a reduced false positive rate. The focus of this paper is rule-learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives. We use RIPPER as the underlying rule classifier. To counter imbalance in data, we implement a combination of oversampling (both by replication and synthetic generation) and undersampling techniques. We also propose a clustering based methodology for oversampling by generating synthetic instances. We evaluate our approaches on two intrusion datasets — destination and actual packets based — constructed from actual Notre Dame traffic, giving a flavor of real-world data with its idiosyncrasies. Using ROC analysis, we show that oversampling by synthetic generation of minority (intrusion) class outperforms oversampling by replication and RIPPER's loss ratio method. Additionally, we establish that our clustering based approach is more suitable for the detecting intrusions and is able to provide additional improvement over just synthetic generation of instances.","PeriodicalId":400997,"journal":{"name":"2006 IEEE International Conference on Granular Computing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"242","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE International Conference on Granular Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRC.2006.1635905","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 242
Abstract
An approach to combating network intrusion is the development of systems applying machine learning and data min- ing techniques. Many IDS (Intrusion Detection Systems) suffer from a high rate of false alarms and missed intrusions. We want to be able to improve the intrusion detection rate at a reduced false positive rate. The focus of this paper is rule-learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives. We use RIPPER as the underlying rule classifier. To counter imbalance in data, we implement a combination of oversampling (both by replication and synthetic generation) and undersampling techniques. We also propose a clustering based methodology for oversampling by generating synthetic instances. We evaluate our approaches on two intrusion datasets — destination and actual packets based — constructed from actual Notre Dame traffic, giving a flavor of real-world data with its idiosyncrasies. Using ROC analysis, we show that oversampling by synthetic generation of minority (intrusion) class outperforms oversampling by replication and RIPPER's loss ratio method. Additionally, we establish that our clustering based approach is more suitable for the detecting intrusions and is able to provide additional improvement over just synthetic generation of instances.