利用随机欠采样方法改进不平衡数据集上恶意软件检测的随机森林算法性能

Jurnal Informatika Jurnal Pengembangan IT Pub Date : 2023-05-30 DOI:10.30591/jpit.v8i2.5207

Fauzi Adi Rafrastara, Catur Supriyanto, Cinantya Paramita, Yani Parti Astuti, Foez Ahmed

{"title":"利用随机欠采样方法改进不平衡数据集上恶意软件检测的随机森林算法性能","authors":"Fauzi Adi Rafrastara, Catur Supriyanto, Cinantya Paramita, Yani Parti Astuti, Foez Ahmed","doi":"10.30591/jpit.v8i2.5207","DOIUrl":null,"url":null,"abstract":"Handling imbalanced dataset has their own challenge. Inappropriate step during the pre-processing phase with imbalanced data could bring the negative effect on prediction result. The accuracy score seems high, but actually there are many problems on recall and specificity side, considering that the produced predictions will be dominated by the majority class. In the case of malware detection, false negative value is very crucial since it can be fatal. Therefore, prediction errors, especially related to false negative, must be minimized. The first step that can be done to handle imbalanced dataset in this crucial condition is by balancing the data class. One of the popular methods to balance the data, called Random Under-Sampling (RUS). Random Forest is implemented to classify the file, whether it is considered as goodware or malware. Next, 3 evaluation metrics are used to evaluate the model by measuring the classification accuracy, recall and specificity. Lastly, the performance of Random Forest is compared with 3 other methods, namely kNN, Naïve Bayes and Logistic Regression. The result shows that Random Forest achieved the best performance among evaluated methods with the score of 98.1% for accuracy, 98.0% for recall, and 98.2% for specificity.","PeriodicalId":53375,"journal":{"name":"Jurnal Informatika Jurnal Pengembangan IT","volume":"360 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method\",\"authors\":\"Fauzi Adi Rafrastara, Catur Supriyanto, Cinantya Paramita, Yani Parti Astuti, Foez Ahmed\",\"doi\":\"10.30591/jpit.v8i2.5207\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Handling imbalanced dataset has their own challenge. Inappropriate step during the pre-processing phase with imbalanced data could bring the negative effect on prediction result. The accuracy score seems high, but actually there are many problems on recall and specificity side, considering that the produced predictions will be dominated by the majority class. In the case of malware detection, false negative value is very crucial since it can be fatal. Therefore, prediction errors, especially related to false negative, must be minimized. The first step that can be done to handle imbalanced dataset in this crucial condition is by balancing the data class. One of the popular methods to balance the data, called Random Under-Sampling (RUS). Random Forest is implemented to classify the file, whether it is considered as goodware or malware. Next, 3 evaluation metrics are used to evaluate the model by measuring the classification accuracy, recall and specificity. Lastly, the performance of Random Forest is compared with 3 other methods, namely kNN, Naïve Bayes and Logistic Regression. The result shows that Random Forest achieved the best performance among evaluated methods with the score of 98.1% for accuracy, 98.0% for recall, and 98.2% for specificity.\",\"PeriodicalId\":53375,\"journal\":{\"name\":\"Jurnal Informatika Jurnal Pengembangan IT\",\"volume\":\"360 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jurnal Informatika Jurnal Pengembangan IT\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30591/jpit.v8i2.5207\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jurnal Informatika Jurnal Pengembangan IT","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30591/jpit.v8i2.5207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

处理不平衡数据集有其自身的挑战。在数据不平衡的情况下，预处理步骤不当会对预测结果产生负面影响。虽然准确率看起来很高，但考虑到生成的预测结果将以多数类为主，实际上在召回率和特异性方面存在很多问题。在恶意软件检测的情况下，假负值是非常关键的，因为它可能是致命的。因此，必须尽量减少预测误差，特别是与假阴性相关的预测误差。在这种关键情况下，处理不平衡数据集的第一步是平衡数据类。一种流行的平衡数据的方法，称为随机欠采样(RUS)。随机森林是实现分类文件，无论它被认为是好软件或恶意软件。接下来，使用3个评价指标通过测量分类准确率、召回率和特异性对模型进行评价。最后，将随机森林的性能与kNN、Naïve贝叶斯和Logistic回归等3种方法进行了比较。结果表明，随机森林方法的准确率为98.1%，召回率为98.0%，特异性为98.2%，是评价方法中表现最好的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method

Handling imbalanced dataset has their own challenge. Inappropriate step during the pre-processing phase with imbalanced data could bring the negative effect on prediction result. The accuracy score seems high, but actually there are many problems on recall and specificity side, considering that the produced predictions will be dominated by the majority class. In the case of malware detection, false negative value is very crucial since it can be fatal. Therefore, prediction errors, especially related to false negative, must be minimized. The first step that can be done to handle imbalanced dataset in this crucial condition is by balancing the data class. One of the popular methods to balance the data, called Random Under-Sampling (RUS). Random Forest is implemented to classify the file, whether it is considered as goodware or malware. Next, 3 evaluation metrics are used to evaluate the model by measuring the classification accuracy, recall and specificity. Lastly, the performance of Random Forest is compared with 3 other methods, namely kNN, Naïve Bayes and Logistic Regression. The result shows that Random Forest achieved the best performance among evaluated methods with the score of 98.1% for accuracy, 98.0% for recall, and 98.2% for specificity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Jurnal Informatika Jurnal Pengembangan IT

自引率

0.00%

发文量

审稿时长

24 weeks