Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Malware Analysis

Michael R. Smith, Nicholas T. Johnson, J. Ingram, A. Carbajal, Bridget I. Haus, Eva Domschot, Ramyaa, Christopher C. Lamb, Stephen J Verzi, W. Kegelmeyer
{"title":"Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Malware Analysis","authors":"Michael R. Smith, Nicholas T. Johnson, J. Ingram, A. Carbajal, Bridget I. Haus, Eva Domschot, Ramyaa, Christopher C. Lamb, Stephen J Verzi, W. Kegelmeyer","doi":"10.1145/3411508.3421373","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities---as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports---1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57% - 84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.","PeriodicalId":132987,"journal":{"name":"Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3411508.3421373","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

Abstract

Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities---as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports---1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57% - 84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.
注意差距:关于弥合机器学习和恶意软件分析之间的语义差距
机器学习(ML)技术正被用于检测越来越多的恶意软件和变体。尽管ML的应用取得了成功,但我们假设,由于ML和MA社区之间的语义差距,ML的全部潜力并未在恶意软件分析(MA)中实现——正如所使用的数据所示。部分由于可用的数据,ML主要关注于检测,而MA也对识别行为感兴趣。我们回顾了ML中使用的现有开源恶意软件数据集,发现缺乏可以促进ML在MA中产生更大影响的行为信息。作为弥合这一差距的第一步,我们使用开源MA报告将现有数据标记为行为信息——1)将分析从识别恶意软件更改为识别行为,2)使ML更好地与MA保持一致,3)允许ML模型以零/几次学习的方式推广到新的恶意软件。我们使用最先进的恶意软件家族分类模型中的迁移学习对训练期间未见的恶意软件家族的行为进行分类,并在行为识别上达到57% - 84%的准确率,但未能优于多数类预测器设置的基线。这突出了与数据表示相关的任务的改进机会,对特定于恶意软件的ML技术的需求,以及带有行为标记的更大的恶意软件样本训练集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信