Md. Al Mehedi Hasan, M. Nasser, B. Pal, Shamim Ahmad
{"title":"Support Vector Machine and Random Forest Modeling for Intrusion Detection System (IDS)","authors":"Md. Al Mehedi Hasan, M. Nasser, B. Pal, Shamim Ahmad","doi":"10.4236/JILSA.2014.61005","DOIUrl":null,"url":null,"abstract":"The success of \nany Intrusion Detection System (IDS) is a complicated problem due to its \nnonlinearity and the quantitative or qualitative network traffic data stream \nwith many features. To get rid of this problem, several types of intrusion \ndetection methods have been proposed and shown different levels of accuracy. \nThis is why the choice of the effective and robust method for IDS is very \nimportant topic in information security. In this work, we have built two models \nfor the classification purpose. One is based on Support Vector Machines (SVM) \nand the other is Random Forests (RF). Experimental results show that either \nclassifier is effective. SVM is slightly more accurate, but more expensive in \nterms of time. RF produces similar accuracy in a much faster manner if given \nmodeling parameters. These classifiers can contribute to an IDS system as one \nsource of analysis and increase its accuracy. In this paper, KDD’99 Dataset is used and find out which \none is the best intrusion \ndetector for this dataset. Statistical \nanalysis on KDD’99 dataset found important issues which highly affect the \nperformance of evaluated systems and results in a very poor evaluation of \nanomaly detection approaches. The most important deficiency in the KDD’99 dataset \nis the huge number of redundant records. To solve these \nissues, we have developed a new dataset, KDD99Train+ and KDD99Test+, which does \nnot include any redundant records in the train set as well as in the test set, \nso the classifiers will not be biased towards more frequent records. The \nnumbers of records in the train and test sets are now reasonable, which make it \naffordable to run the experiments on the complete set without the need to \nrandomly select a small portion. The findings of this paper will be very useful \nto use SVM and RF in a more \nmeaningful way in order to maximize the performance rate and minimize the false \nnegative rate.","PeriodicalId":69452,"journal":{"name":"智能学习系统与应用(英文)","volume":"6 1","pages":"45-52"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"134","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"智能学习系统与应用(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.4236/JILSA.2014.61005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 134
Abstract
The success of
any Intrusion Detection System (IDS) is a complicated problem due to its
nonlinearity and the quantitative or qualitative network traffic data stream
with many features. To get rid of this problem, several types of intrusion
detection methods have been proposed and shown different levels of accuracy.
This is why the choice of the effective and robust method for IDS is very
important topic in information security. In this work, we have built two models
for the classification purpose. One is based on Support Vector Machines (SVM)
and the other is Random Forests (RF). Experimental results show that either
classifier is effective. SVM is slightly more accurate, but more expensive in
terms of time. RF produces similar accuracy in a much faster manner if given
modeling parameters. These classifiers can contribute to an IDS system as one
source of analysis and increase its accuracy. In this paper, KDD’99 Dataset is used and find out which
one is the best intrusion
detector for this dataset. Statistical
analysis on KDD’99 dataset found important issues which highly affect the
performance of evaluated systems and results in a very poor evaluation of
anomaly detection approaches. The most important deficiency in the KDD’99 dataset
is the huge number of redundant records. To solve these
issues, we have developed a new dataset, KDD99Train+ and KDD99Test+, which does
not include any redundant records in the train set as well as in the test set,
so the classifiers will not be biased towards more frequent records. The
numbers of records in the train and test sets are now reasonable, which make it
affordable to run the experiments on the complete set without the need to
randomly select a small portion. The findings of this paper will be very useful
to use SVM and RF in a more
meaningful way in order to maximize the performance rate and minimize the false
negative rate.