An Empirical Evaluation of Automated Machine Learning Techniques for Malware Detection

P. P. Kundu, Lux Anatharaman, Tram Truong-Huu
{"title":"An Empirical Evaluation of Automated Machine Learning Techniques for Malware Detection","authors":"P. P. Kundu, Lux Anatharaman, Tram Truong-Huu","doi":"10.1145/3445970.3451155","DOIUrl":null,"url":null,"abstract":"Nowadays, it is increasingly difficult even for a machine learning expert to incorporate all of the recent best practices into their modeling due to the fast development of state-of-the-art machine learning techniques. For the applications that handle big data sets, the complexity of the problem of choosing the best performing model with the best hyper-parameter setting becomes harder. In this work, we present an empirical evaluation of automated machine learning (AutoML) frameworks or techniques that aim to optimize hyper-parameters for machine learning models to achieve the best achievable performance. We apply AutoML techniques to the malware detection problem, which requires achieving the true positive rate as high as possible while reducing the false positive rate as low as possible. We adopt two AutoML frameworks, namely AutoGluon-Tabular and Microsoft Neural Network Intelligence (NNI) to optimize hyper-parameters of a Light Gradient Boosted Machine (LightGBM) model for classifying malware samples. We carry out extensive experiments on two data sets. The first data set is a publicly available data set (EMBER data set), that has been used as a benchmarking data set for many malware detection works. The second data set is a private data set we have acquired from a security company that provides recently-collected malware samples. We provide empirical analysis and performance comparison of the two AutoML frameworks. The experimental results show that AutoML frameworks could identify the set of hyper-parameters that significantly outperform the performance of the model with the known best performing hyper-parameter setting and improve the performance of a LightGBM classifier with respect to the true positive rate from $86.8%$ to $90%$ at $0.1%$ of false positive rate on EMBER data set and from $80.8%$ to $87.4%$ on the private data set.","PeriodicalId":117291,"journal":{"name":"Proceedings of the 2021 ACM Workshop on Security and Privacy Analytics","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 ACM Workshop on Security and Privacy Analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3445970.3451155","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Nowadays, it is increasingly difficult even for a machine learning expert to incorporate all of the recent best practices into their modeling due to the fast development of state-of-the-art machine learning techniques. For the applications that handle big data sets, the complexity of the problem of choosing the best performing model with the best hyper-parameter setting becomes harder. In this work, we present an empirical evaluation of automated machine learning (AutoML) frameworks or techniques that aim to optimize hyper-parameters for machine learning models to achieve the best achievable performance. We apply AutoML techniques to the malware detection problem, which requires achieving the true positive rate as high as possible while reducing the false positive rate as low as possible. We adopt two AutoML frameworks, namely AutoGluon-Tabular and Microsoft Neural Network Intelligence (NNI) to optimize hyper-parameters of a Light Gradient Boosted Machine (LightGBM) model for classifying malware samples. We carry out extensive experiments on two data sets. The first data set is a publicly available data set (EMBER data set), that has been used as a benchmarking data set for many malware detection works. The second data set is a private data set we have acquired from a security company that provides recently-collected malware samples. We provide empirical analysis and performance comparison of the two AutoML frameworks. The experimental results show that AutoML frameworks could identify the set of hyper-parameters that significantly outperform the performance of the model with the known best performing hyper-parameter setting and improve the performance of a LightGBM classifier with respect to the true positive rate from $86.8%$ to $90%$ at $0.1%$ of false positive rate on EMBER data set and from $80.8%$ to $87.4%$ on the private data set.
用于恶意软件检测的自动机器学习技术的经验评估
如今,由于最先进的机器学习技术的快速发展,即使是机器学习专家也越来越难以将所有最新的最佳实践纳入他们的建模中。对于处理大数据集的应用程序,选择具有最佳超参数设置的最佳表现模型的问题变得更加复杂。在这项工作中,我们提出了自动化机器学习(AutoML)框架或技术的经验评估,旨在优化机器学习模型的超参数,以实现最佳的可实现性能。我们将AutoML技术应用于恶意软件检测问题,该问题要求实现尽可能高的真阳性率,同时尽可能低的假阳性率。我们采用AutoGluon-Tabular和Microsoft Neural Network Intelligence (NNI)两个AutoML框架对Light Gradient boosting Machine (LightGBM)模型的超参数进行优化,用于恶意软件样本分类。我们在两个数据集上进行了广泛的实验。第一个数据集是一个公开可用的数据集(EMBER数据集),它已被用作许多恶意软件检测工作的基准数据集。第二个数据集是我们从一家提供最近收集的恶意软件样本的安全公司获得的私人数据集。我们对两种AutoML框架进行了实证分析和性能比较。实验结果表明,AutoML框架可以识别出明显优于已知最佳超参数设置模型的超参数集,并将LightGBM分类器的性能从EMBER数据集的真阳性率从86.8%提高到90%,假阳性率为0.1%,在私有数据集上从80.8%提高到87.4%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信