基于网络流的特征和训练集大小对恶意软件检测的影响

J. Jiménez, K. Goseva-Popstojanova
{"title":"基于网络流的特征和训练集大小对恶意软件检测的影响","authors":"J. Jiménez, K. Goseva-Popstojanova","doi":"10.1109/NCA.2018.8548325","DOIUrl":null,"url":null,"abstract":"Although network flows have been used in areas such as network traffic analysis and botnet detection, not many works have used network flows-based features for malware detection. This paper is focused on malware detection based on using features extracted from the network traffic and system logs. We evaluated the performance of four supervised machine learning algorithms (i.e., J48, Random Forest, Naive Bayes, and PART) for malware detection and identified the best learner. Furthermore, we used feature selection based on information gain to identify the smallest number of features needed for classification. In addition, we experimented with training sets of different sizes. The main findings include: (1) Adding network flows-based features improved significantly the performance of malware detection. (2) J48 and PART were the best performing learners, with the highest F-score and G-score values. (3) Using J48, the top five features ranked by information gain attained the same performance as when using all 88 features. In the case of PART, the top fourteen features ranked by information gain led to the same performance as when all 88 features were used. None of the system logs-based features were included in these two models. (4) The classification performance when training on 75% of the data was comparable to training on 90% of the data. As little as 25% of the data can be used for training at an expense of somewhat higher, but not very significant performance degradation (i.e., less than 7% for F-score and 6% for G-score compared to when 90% of the data were used for training).","PeriodicalId":268662,"journal":{"name":"2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)","volume":"82 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"The Effect on Network Flows-Based Features and Training Set Size on Malware Detection\",\"authors\":\"J. Jiménez, K. Goseva-Popstojanova\",\"doi\":\"10.1109/NCA.2018.8548325\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although network flows have been used in areas such as network traffic analysis and botnet detection, not many works have used network flows-based features for malware detection. This paper is focused on malware detection based on using features extracted from the network traffic and system logs. We evaluated the performance of four supervised machine learning algorithms (i.e., J48, Random Forest, Naive Bayes, and PART) for malware detection and identified the best learner. Furthermore, we used feature selection based on information gain to identify the smallest number of features needed for classification. In addition, we experimented with training sets of different sizes. The main findings include: (1) Adding network flows-based features improved significantly the performance of malware detection. (2) J48 and PART were the best performing learners, with the highest F-score and G-score values. (3) Using J48, the top five features ranked by information gain attained the same performance as when using all 88 features. In the case of PART, the top fourteen features ranked by information gain led to the same performance as when all 88 features were used. None of the system logs-based features were included in these two models. (4) The classification performance when training on 75% of the data was comparable to training on 90% of the data. As little as 25% of the data can be used for training at an expense of somewhat higher, but not very significant performance degradation (i.e., less than 7% for F-score and 6% for G-score compared to when 90% of the data were used for training).\",\"PeriodicalId\":268662,\"journal\":{\"name\":\"2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)\",\"volume\":\"82 8\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCA.2018.8548325\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 17th International Symposium on Network Computing and Applications (NCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCA.2018.8548325","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

尽管网络流已经被用于网络流量分析和僵尸网络检测等领域,但使用基于网络流的特性进行恶意软件检测的作品并不多。本文主要研究基于网络流量和系统日志特征提取的恶意软件检测。我们评估了四种监督机器学习算法(即J48、随机森林、朴素贝叶斯和PART)在恶意软件检测方面的性能,并确定了最佳学习算法。此外,我们使用基于信息增益的特征选择来识别分类所需的最小数量的特征。此外,我们对不同大小的训练集进行了实验。主要发现包括:(1)增加基于网络流的特征显著提高了恶意软件检测的性能。(2) J48和PART是表现最好的学习者,f值和g值最高。(3)使用J48时,信息增益排名前5位的特征与使用全部88个特征时的性能相同。在PART中,按信息增益排序的前14个特征与使用全部88个特征时的性能相同。这两个模型中没有包含任何基于系统日志的特性。(4) 75%数据训练的分类性能与90%数据训练的分类性能相当。只有25%的数据可以用于训练,代价略高,但性能下降不是很明显(即,与90%的数据用于训练相比,f分数不到7%,g分数不到6%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The Effect on Network Flows-Based Features and Training Set Size on Malware Detection
Although network flows have been used in areas such as network traffic analysis and botnet detection, not many works have used network flows-based features for malware detection. This paper is focused on malware detection based on using features extracted from the network traffic and system logs. We evaluated the performance of four supervised machine learning algorithms (i.e., J48, Random Forest, Naive Bayes, and PART) for malware detection and identified the best learner. Furthermore, we used feature selection based on information gain to identify the smallest number of features needed for classification. In addition, we experimented with training sets of different sizes. The main findings include: (1) Adding network flows-based features improved significantly the performance of malware detection. (2) J48 and PART were the best performing learners, with the highest F-score and G-score values. (3) Using J48, the top five features ranked by information gain attained the same performance as when using all 88 features. In the case of PART, the top fourteen features ranked by information gain led to the same performance as when all 88 features were used. None of the system logs-based features were included in these two models. (4) The classification performance when training on 75% of the data was comparable to training on 90% of the data. As little as 25% of the data can be used for training at an expense of somewhat higher, but not very significant performance degradation (i.e., less than 7% for F-score and 6% for G-score compared to when 90% of the data were used for training).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信