Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Malware Analysis

Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security Pub Date : 2020-11-01 DOI:10.1145/3411508.3421373

Michael R. Smith, Nicholas T. Johnson, J. Ingram, A. Carbajal, Bridget I. Haus, Eva Domschot, Ramyaa, Christopher C. Lamb, Stephen J Verzi, W. Kegelmeyer

{"title":"Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Malware Analysis","authors":"Michael R. Smith, Nicholas T. Johnson, J. Ingram, A. Carbajal, Bridget I. Haus, Eva Domschot, Ramyaa, Christopher C. Lamb, Stephen J Verzi, W. Kegelmeyer","doi":"10.1145/3411508.3421373","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities---as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports---1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57% - 84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.","PeriodicalId":132987,"journal":{"name":"Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3411508.3421373","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities---as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports---1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57% - 84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.

查看原文本刊更多论文

注意差距:关于弥合机器学习和恶意软件分析之间的语义差距

机器学习(ML)技术正被用于检测越来越多的恶意软件和变体。尽管ML的应用取得了成功，但我们假设，由于ML和MA社区之间的语义差距，ML的全部潜力并未在恶意软件分析(MA)中实现——正如所使用的数据所示。部分由于可用的数据，ML主要关注于检测，而MA也对识别行为感兴趣。我们回顾了ML中使用的现有开源恶意软件数据集，发现缺乏可以促进ML在MA中产生更大影响的行为信息。作为弥合这一差距的第一步，我们使用开源MA报告将现有数据标记为行为信息——1)将分析从识别恶意软件更改为识别行为，2)使ML更好地与MA保持一致，3)允许ML模型以零/几次学习的方式推广到新的恶意软件。我们使用最先进的恶意软件家族分类模型中的迁移学习对训练期间未见的恶意软件家族的行为进行分类，并在行为识别上达到57% - 84%的准确率，但未能优于多数类预测器设置的基线。这突出了与数据表示相关的任务的改进机会，对特定于恶意软件的ML技术的需求，以及带有行为标记的更大的恶意软件样本训练集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security

自引率

0.00%

发文量