{"title":"基于数据清洗的入侵检测系统关联特征选择","authors":"S. Suthaharan, T. Panchagnula","doi":"10.1109/SECON.2012.6196965","DOIUrl":null,"url":null,"abstract":"Labeled datasets play a major role in the process of validating and evaluating machine learning techniques in intrusion detection systems. In order to obtain good accuracy in the evaluation, very large datasets should be considered. Intrusion traffic and normal traffic are in general dependent on a large number of network characteristics called features. However not all of these features contribute to the traffic characteristics. Therefore, eliminating the non-contributing features from the datasets, to facilitate speed and accuracy to the evaluation of machine learning techniques, becomes an important requirement. In this paper we suggest an approach which analyzes the intrusion datasets, evaluates the features for its relevance to a specific attack, determines the level of contribution of feature, and eliminates it from the dataset automatically. We adopt the Rough Set Theory (RST) based approach and select relevance features using multidimensional scatter-plot automatically. A pair-wise feature selection process is adopted to simplify. In our previous research we used KDD'99 dataset and validated the RST based approach. There are lots of redundant data entries in KDD'99 and thus the machine learning techniques are biased towards most occurring events. This property leads the algorithms to ignore less frequent events which can be more harmful than most occurring events. False positives are another important drawback in KDD'99 dataset. In this paper, we adopt NSL-KDD dataset (an improved version of KDD'99 dataset) and validate the automated RST based approach. The approach presented in this paper leads to a selection of most relevance features and we expect that the intrusion detection research using KDD'99-based datasets will benefit from the good understanding of network features and their influences to attacks.","PeriodicalId":187091,"journal":{"name":"2012 Proceedings of IEEE Southeastcon","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Relevance feature selection with data cleaning for intrusion detection system\",\"authors\":\"S. Suthaharan, T. Panchagnula\",\"doi\":\"10.1109/SECON.2012.6196965\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Labeled datasets play a major role in the process of validating and evaluating machine learning techniques in intrusion detection systems. In order to obtain good accuracy in the evaluation, very large datasets should be considered. Intrusion traffic and normal traffic are in general dependent on a large number of network characteristics called features. However not all of these features contribute to the traffic characteristics. Therefore, eliminating the non-contributing features from the datasets, to facilitate speed and accuracy to the evaluation of machine learning techniques, becomes an important requirement. In this paper we suggest an approach which analyzes the intrusion datasets, evaluates the features for its relevance to a specific attack, determines the level of contribution of feature, and eliminates it from the dataset automatically. We adopt the Rough Set Theory (RST) based approach and select relevance features using multidimensional scatter-plot automatically. A pair-wise feature selection process is adopted to simplify. In our previous research we used KDD'99 dataset and validated the RST based approach. There are lots of redundant data entries in KDD'99 and thus the machine learning techniques are biased towards most occurring events. This property leads the algorithms to ignore less frequent events which can be more harmful than most occurring events. False positives are another important drawback in KDD'99 dataset. In this paper, we adopt NSL-KDD dataset (an improved version of KDD'99 dataset) and validate the automated RST based approach. The approach presented in this paper leads to a selection of most relevance features and we expect that the intrusion detection research using KDD'99-based datasets will benefit from the good understanding of network features and their influences to attacks.\",\"PeriodicalId\":187091,\"journal\":{\"name\":\"2012 Proceedings of IEEE Southeastcon\",\"volume\":\"93 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-03-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 Proceedings of IEEE Southeastcon\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SECON.2012.6196965\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Proceedings of IEEE Southeastcon","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SECON.2012.6196965","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Relevance feature selection with data cleaning for intrusion detection system
Labeled datasets play a major role in the process of validating and evaluating machine learning techniques in intrusion detection systems. In order to obtain good accuracy in the evaluation, very large datasets should be considered. Intrusion traffic and normal traffic are in general dependent on a large number of network characteristics called features. However not all of these features contribute to the traffic characteristics. Therefore, eliminating the non-contributing features from the datasets, to facilitate speed and accuracy to the evaluation of machine learning techniques, becomes an important requirement. In this paper we suggest an approach which analyzes the intrusion datasets, evaluates the features for its relevance to a specific attack, determines the level of contribution of feature, and eliminates it from the dataset automatically. We adopt the Rough Set Theory (RST) based approach and select relevance features using multidimensional scatter-plot automatically. A pair-wise feature selection process is adopted to simplify. In our previous research we used KDD'99 dataset and validated the RST based approach. There are lots of redundant data entries in KDD'99 and thus the machine learning techniques are biased towards most occurring events. This property leads the algorithms to ignore less frequent events which can be more harmful than most occurring events. False positives are another important drawback in KDD'99 dataset. In this paper, we adopt NSL-KDD dataset (an improved version of KDD'99 dataset) and validate the automated RST based approach. The approach presented in this paper leads to a selection of most relevance features and we expect that the intrusion detection research using KDD'99-based datasets will benefit from the good understanding of network features and their influences to attacks.