{"title":"基于反向传播神经网络的KDD CUP 1999数据集特征约简异常检测","authors":"Bhavin Shah, Bhushan Trivedi","doi":"10.1109/ACCT.2015.131","DOIUrl":null,"url":null,"abstract":"To detect and classify the anomaly in computer network, KDD CUP 1999 dataset is extensively used. This KDD CUP 1999 data set was generated by domain expert at MIT Lincon lab. To reduced number of features of this KDD CUP data set, various feature reduction techniques has been already used. These techniques reduce features from 41 into range of 10 to 22. Usage of such reduced dataset in machine learning algorithm leads to lower complexity, less processing time and high accuracy. Out of the various feature reduction technique available, one of them is Information Gain (IG) which has been already applied for the random forests classifier by Tesfahun et al. Tesfahun's approach reduces time and complexity of model and improves the detection rate for the minority classes in a considerable amount. This work investigates the effectiveness and the feasibility of Tesfahun et al.'s feature reduction technique on Back Propagation Neural Network classifier. We had performed various experiments on KDD CUP 1999 dataset and recorded Accuracy, Precision, Recall and Fscore values. In this work, we had done Basic, N-Fold Validation and Testing comparisons on reduced dataset with full feature dataset. Basic comparison clearly shows that the reduced dataset outer performs on size, time and complexity parameters. Experiments of N-Fold validation show that classifier that uses reduced dataset, have better generalization capacity. During the testing comparison, we found both the datasets are equally compatible. All the three comparisons clearly show that reduced dataset is better or is equally compatible, and does not have any drawback as compared to full dataset. Our experiments shows that usage of such reduced dataset in BPNN can lead to better model in terms of dataset size, complexity, processing time and generalization ability.","PeriodicalId":351783,"journal":{"name":"2015 Fifth International Conference on Advanced Computing & Communication Technologies","volume":"07 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"Reducing Features of KDD CUP 1999 Dataset for Anomaly Detection Using Back Propagation Neural Network\",\"authors\":\"Bhavin Shah, Bhushan Trivedi\",\"doi\":\"10.1109/ACCT.2015.131\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To detect and classify the anomaly in computer network, KDD CUP 1999 dataset is extensively used. This KDD CUP 1999 data set was generated by domain expert at MIT Lincon lab. To reduced number of features of this KDD CUP data set, various feature reduction techniques has been already used. These techniques reduce features from 41 into range of 10 to 22. Usage of such reduced dataset in machine learning algorithm leads to lower complexity, less processing time and high accuracy. Out of the various feature reduction technique available, one of them is Information Gain (IG) which has been already applied for the random forests classifier by Tesfahun et al. Tesfahun's approach reduces time and complexity of model and improves the detection rate for the minority classes in a considerable amount. This work investigates the effectiveness and the feasibility of Tesfahun et al.'s feature reduction technique on Back Propagation Neural Network classifier. We had performed various experiments on KDD CUP 1999 dataset and recorded Accuracy, Precision, Recall and Fscore values. In this work, we had done Basic, N-Fold Validation and Testing comparisons on reduced dataset with full feature dataset. Basic comparison clearly shows that the reduced dataset outer performs on size, time and complexity parameters. Experiments of N-Fold validation show that classifier that uses reduced dataset, have better generalization capacity. During the testing comparison, we found both the datasets are equally compatible. All the three comparisons clearly show that reduced dataset is better or is equally compatible, and does not have any drawback as compared to full dataset. Our experiments shows that usage of such reduced dataset in BPNN can lead to better model in terms of dataset size, complexity, processing time and generalization ability.\",\"PeriodicalId\":351783,\"journal\":{\"name\":\"2015 Fifth International Conference on Advanced Computing & Communication Technologies\",\"volume\":\"07 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-02-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 Fifth International Conference on Advanced Computing & Communication Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACCT.2015.131\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Fifth International Conference on Advanced Computing & Communication Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACCT.2015.131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26
摘要
为了对计算机网络中的异常进行检测和分类,广泛使用了KDD CUP 1999数据集。这个KDD CUP 1999数据集由麻省理工学院林肯实验室的领域专家生成。为了减少这个KDD CUP数据集的特征数量,已经使用了各种特征约简技术。这些技术将特征从41个减少到10到22个。在机器学习算法中使用这种简化的数据集,降低了复杂性,减少了处理时间,提高了精度。在各种可用的特征约简技术中,其中一种是信息增益(Information Gain, IG),该技术已被Tesfahun等人应用于随机森林分类器。Tesfahun的方法减少了模型的时间和复杂性,并在很大程度上提高了少数类的检测率。本文研究了Tesfahun等人的特征约简技术在反向传播神经网络分类器上的有效性和可行性。我们在KDD CUP 1999数据集上进行了各种实验,记录了准确率、精密度、召回率和Fscore值。在这项工作中,我们在简化数据集和完整特征数据集上进行了基本、N-Fold验证和测试比较。基本比较清楚地表明,简化后的数据集在大小、时间和复杂性参数上表现良好。N-Fold验证实验表明,使用简化数据集的分类器具有更好的泛化能力。在测试比较中,我们发现两个数据集是同样兼容的。这三种比较都清楚地表明,与完整数据集相比,简化后的数据集更好或同样兼容,并且没有任何缺点。我们的实验表明,在BPNN中使用这种简化的数据集可以在数据集大小、复杂性、处理时间和泛化能力方面获得更好的模型。
Reducing Features of KDD CUP 1999 Dataset for Anomaly Detection Using Back Propagation Neural Network
To detect and classify the anomaly in computer network, KDD CUP 1999 dataset is extensively used. This KDD CUP 1999 data set was generated by domain expert at MIT Lincon lab. To reduced number of features of this KDD CUP data set, various feature reduction techniques has been already used. These techniques reduce features from 41 into range of 10 to 22. Usage of such reduced dataset in machine learning algorithm leads to lower complexity, less processing time and high accuracy. Out of the various feature reduction technique available, one of them is Information Gain (IG) which has been already applied for the random forests classifier by Tesfahun et al. Tesfahun's approach reduces time and complexity of model and improves the detection rate for the minority classes in a considerable amount. This work investigates the effectiveness and the feasibility of Tesfahun et al.'s feature reduction technique on Back Propagation Neural Network classifier. We had performed various experiments on KDD CUP 1999 dataset and recorded Accuracy, Precision, Recall and Fscore values. In this work, we had done Basic, N-Fold Validation and Testing comparisons on reduced dataset with full feature dataset. Basic comparison clearly shows that the reduced dataset outer performs on size, time and complexity parameters. Experiments of N-Fold validation show that classifier that uses reduced dataset, have better generalization capacity. During the testing comparison, we found both the datasets are equally compatible. All the three comparisons clearly show that reduced dataset is better or is equally compatible, and does not have any drawback as compared to full dataset. Our experiments shows that usage of such reduced dataset in BPNN can lead to better model in terms of dataset size, complexity, processing time and generalization ability.