基于网络流的特征和训练集大小对恶意软件检测的影响

2018 IEEE 17th International Symposium on Network Computing and Applications (NCA) Pub Date : 2018-11-01 DOI:10.1109/NCA.2018.8548325

J. Jiménez, K. Goseva-Popstojanova

{"title":"基于网络流的特征和训练集大小对恶意软件检测的影响","authors":"J. Jiménez, K. Goseva-Popstojanova","doi":"10.1109/NCA.2018.8548325","DOIUrl":null,"url":null,"abstract":"Although network flows have been used in areas such as network traffic analysis and botnet detection, not many works have used network flows-based features for malware detection. This paper is focused on malware detection based on using features extracted from the network traffic and system logs. We evaluated the performance of four supervised machine learning algorithms (i.e., J48, Random Forest, Naive Bayes, and PART) for malware detection and identified the best learner. Furthermore, we used feature selection based on information gain to identify the smallest number of features needed for classification. In addition, we experimented with training sets of different sizes. The main findings include: (1) Adding network flows-based features improved significantly the performance of malware detection. (2) J48 and PART were the best performing learners, with the highest F-score and G-score values. (3) Using J48, the top five features ranked by information gain attained the same performance as when using all 88 features. In the case of PART, the top fourteen features ranked by information gain led to the same performance as when all 88 features were used. None of the system logs-based features were included in these two models. (4) The classification performance when training on 75% of the data was comparable to training on 90% of the data. As little as 25% of the data can be used for training at an expense of somewhat higher, but not very significant performance degradation (i.e., less than 7% for F-score and 6% for G-score compared to when 90% of the data were used for training).","PeriodicalId":268662,"journal":{"name":"2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)","volume":"82 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"The Effect on Network Flows-Based Features and Training Set Size on Malware Detection\",\"authors\":\"J. Jiménez, K. Goseva-Popstojanova\",\"doi\":\"10.1109/NCA.2018.8548325\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although network flows have been used in areas such as network traffic analysis and botnet detection, not many works have used network flows-based features for malware detection. This paper is focused on malware detection based on using features extracted from the network traffic and system logs. We evaluated the performance of four supervised machine learning algorithms (i.e., J48, Random Forest, Naive Bayes, and PART) for malware detection and identified the best learner. Furthermore, we used feature selection based on information gain to identify the smallest number of features needed for classification. In addition, we experimented with training sets of different sizes. The main findings include: (1) Adding network flows-based features improved significantly the performance of malware detection. (2) J48 and PART were the best performing learners, with the highest F-score and G-score values. (3) Using J48, the top five features ranked by information gain attained the same performance as when using all 88 features. In the case of PART, the top fourteen features ranked by information gain led to the same performance as when all 88 features were used. None of the system logs-based features were included in these two models. (4) The classification performance when training on 75% of the data was comparable to training on 90% of the data. As little as 25% of the data can be used for training at an expense of somewhat higher, but not very significant performance degradation (i.e., less than 7% for F-score and 6% for G-score compared to when 90% of the data were used for training).\",\"PeriodicalId\":268662,\"journal\":{\"name\":\"2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)\",\"volume\":\"82 8\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCA.2018.8548325\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCA.2018.8548325","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

尽管网络流已经被用于网络流量分析和僵尸网络检测等领域，但使用基于网络流的特性进行恶意软件检测的作品并不多。本文主要研究基于网络流量和系统日志特征提取的恶意软件检测。我们评估了四种监督机器学习算法(即J48、随机森林、朴素贝叶斯和PART)在恶意软件检测方面的性能，并确定了最佳学习算法。此外，我们使用基于信息增益的特征选择来识别分类所需的最小数量的特征。此外，我们对不同大小的训练集进行了实验。主要发现包括:(1)增加基于网络流的特征显著提高了恶意软件检测的性能。(2) J48和PART是表现最好的学习者，f值和g值最高。(3)使用J48时，信息增益排名前5位的特征与使用全部88个特征时的性能相同。在PART中，按信息增益排序的前14个特征与使用全部88个特征时的性能相同。这两个模型中没有包含任何基于系统日志的特性。(4) 75%数据训练的分类性能与90%数据训练的分类性能相当。只有25%的数据可以用于训练，代价略高，但性能下降不是很明显(即，与90%的数据用于训练相比，f分数不到7%，g分数不到6%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Effect on Network Flows-Based Features and Training Set Size on Malware Detection

Although network flows have been used in areas such as network traffic analysis and botnet detection, not many works have used network flows-based features for malware detection. This paper is focused on malware detection based on using features extracted from the network traffic and system logs. We evaluated the performance of four supervised machine learning algorithms (i.e., J48, Random Forest, Naive Bayes, and PART) for malware detection and identified the best learner. Furthermore, we used feature selection based on information gain to identify the smallest number of features needed for classification. In addition, we experimented with training sets of different sizes. The main findings include: (1) Adding network flows-based features improved significantly the performance of malware detection. (2) J48 and PART were the best performing learners, with the highest F-score and G-score values. (3) Using J48, the top five features ranked by information gain attained the same performance as when using all 88 features. In the case of PART, the top fourteen features ranked by information gain led to the same performance as when all 88 features were used. None of the system logs-based features were included in these two models. (4) The classification performance when training on 75% of the data was comparable to training on 90% of the data. As little as 25% of the data can be used for training at an expense of somewhat higher, but not very significant performance degradation (i.e., less than 7% for F-score and 6% for G-score compared to when 90% of the data were used for training).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)

自引率

0.00%

发文量