Fauzi Adi Rafrastara, Catur Supriyanto, Cinantya Paramita, Yani Parti Astuti, Foez Ahmed
{"title":"Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method","authors":"Fauzi Adi Rafrastara, Catur Supriyanto, Cinantya Paramita, Yani Parti Astuti, Foez Ahmed","doi":"10.30591/jpit.v8i2.5207","DOIUrl":null,"url":null,"abstract":"Handling imbalanced dataset has their own challenge. Inappropriate step during the pre-processing phase with imbalanced data could bring the negative effect on prediction result. The accuracy score seems high, but actually there are many problems on recall and specificity side, considering that the produced predictions will be dominated by the majority class. In the case of malware detection, false negative value is very crucial since it can be fatal. Therefore, prediction errors, especially related to false negative, must be minimized. The first step that can be done to handle imbalanced dataset in this crucial condition is by balancing the data class. One of the popular methods to balance the data, called Random Under-Sampling (RUS). Random Forest is implemented to classify the file, whether it is considered as goodware or malware. Next, 3 evaluation metrics are used to evaluate the model by measuring the classification accuracy, recall and specificity. Lastly, the performance of Random Forest is compared with 3 other methods, namely kNN, Naïve Bayes and Logistic Regression. The result shows that Random Forest achieved the best performance among evaluated methods with the score of 98.1% for accuracy, 98.0% for recall, and 98.2% for specificity.","PeriodicalId":53375,"journal":{"name":"Jurnal Informatika Jurnal Pengembangan IT","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jurnal Informatika Jurnal Pengembangan IT","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30591/jpit.v8i2.5207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Handling imbalanced dataset has their own challenge. Inappropriate step during the pre-processing phase with imbalanced data could bring the negative effect on prediction result. The accuracy score seems high, but actually there are many problems on recall and specificity side, considering that the produced predictions will be dominated by the majority class. In the case of malware detection, false negative value is very crucial since it can be fatal. Therefore, prediction errors, especially related to false negative, must be minimized. The first step that can be done to handle imbalanced dataset in this crucial condition is by balancing the data class. One of the popular methods to balance the data, called Random Under-Sampling (RUS). Random Forest is implemented to classify the file, whether it is considered as goodware or malware. Next, 3 evaluation metrics are used to evaluate the model by measuring the classification accuracy, recall and specificity. Lastly, the performance of Random Forest is compared with 3 other methods, namely kNN, Naïve Bayes and Logistic Regression. The result shows that Random Forest achieved the best performance among evaluated methods with the score of 98.1% for accuracy, 98.0% for recall, and 98.2% for specificity.