基于情境化向量嵌入的恶意软件检测

2023 Silicon Valley Cybersecurity Conference (SVCC) Pub Date : 2023-05-17 DOI:10.1109/SVCC56964.2023.10165170

Vinay Pandya, Fabio Di Troia

{"title":"基于情境化向量嵌入的恶意软件检测","authors":"Vinay Pandya, Fabio Di Troia","doi":"10.1109/SVCC56964.2023.10165170","DOIUrl":null,"url":null,"abstract":"Detecting malware is an integral part of system security. In recent years, machine learning models have been applied with success to overcome this challenging problem. The aim of this research is to apply context-dependent word embeddings to classify malware. We extract opcodes from the malware samples and use them to generate the embeddings that train the classifiers. Transformers are a novel architecture that utilizes self-attention to handle long-range dependencies. Different transformer architectures, namely, BERT, DistilBERT, AIBERT, and RoBERTa, are implemented in this work to generate context-dependent word embeddings. Apart from using transformer models, we also experimented with ELMo, a bidirectional language model which can generate contextualized opcode embeddings. These embeddings are used to train our machine learning models in classifying samples from different malware families. We compared our contextualized results with context-free embeddings generated by Word2Vec, and HMM2Vec algorithms. The classification algorithms trained on our embeddings consist of Resnet-18 CNN, Random Forest, Support Vector Machines (SVMs), and k-Nearest Neighbours (k-NNs).","PeriodicalId":243155,"journal":{"name":"2023 Silicon Valley Cybersecurity Conference (SVCC)","volume":"312 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Malware Detection through Contextualized Vector Embeddings\",\"authors\":\"Vinay Pandya, Fabio Di Troia\",\"doi\":\"10.1109/SVCC56964.2023.10165170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting malware is an integral part of system security. In recent years, machine learning models have been applied with success to overcome this challenging problem. The aim of this research is to apply context-dependent word embeddings to classify malware. We extract opcodes from the malware samples and use them to generate the embeddings that train the classifiers. Transformers are a novel architecture that utilizes self-attention to handle long-range dependencies. Different transformer architectures, namely, BERT, DistilBERT, AIBERT, and RoBERTa, are implemented in this work to generate context-dependent word embeddings. Apart from using transformer models, we also experimented with ELMo, a bidirectional language model which can generate contextualized opcode embeddings. These embeddings are used to train our machine learning models in classifying samples from different malware families. We compared our contextualized results with context-free embeddings generated by Word2Vec, and HMM2Vec algorithms. The classification algorithms trained on our embeddings consist of Resnet-18 CNN, Random Forest, Support Vector Machines (SVMs), and k-Nearest Neighbours (k-NNs).\",\"PeriodicalId\":243155,\"journal\":{\"name\":\"2023 Silicon Valley Cybersecurity Conference (SVCC)\",\"volume\":\"312 \",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Silicon Valley Cybersecurity Conference (SVCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SVCC56964.2023.10165170\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Silicon Valley Cybersecurity Conference (SVCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SVCC56964.2023.10165170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

检测恶意软件是系统安全的一个组成部分。近年来，机器学习模型已经成功地应用于克服这一具有挑战性的问题。本研究的目的是应用上下文相关的词嵌入对恶意软件进行分类。我们从恶意软件样本中提取操作码，并使用它们来生成训练分类器的嵌入。变形金刚是一种新颖的体系结构，它利用自关注来处理远程依赖关系。不同的转换器架构，即BERT、DistilBERT、AIBERT和RoBERTa，在这项工作中被实现来生成上下文相关的词嵌入。除了使用转换模型外，我们还尝试了ELMo，这是一种可以生成上下文化操作码嵌入的双向语言模型。这些嵌入用于训练我们的机器学习模型对来自不同恶意软件家族的样本进行分类。我们将上下文化的结果与Word2Vec和HMM2Vec算法生成的无上下文嵌入进行了比较。在我们的嵌入上训练的分类算法包括Resnet-18 CNN、随机森林、支持向量机(svm)和k-近邻(k- nn)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Malware Detection through Contextualized Vector Embeddings

Detecting malware is an integral part of system security. In recent years, machine learning models have been applied with success to overcome this challenging problem. The aim of this research is to apply context-dependent word embeddings to classify malware. We extract opcodes from the malware samples and use them to generate the embeddings that train the classifiers. Transformers are a novel architecture that utilizes self-attention to handle long-range dependencies. Different transformer architectures, namely, BERT, DistilBERT, AIBERT, and RoBERTa, are implemented in this work to generate context-dependent word embeddings. Apart from using transformer models, we also experimented with ELMo, a bidirectional language model which can generate contextualized opcode embeddings. These embeddings are used to train our machine learning models in classifying samples from different malware families. We compared our contextualized results with context-free embeddings generated by Word2Vec, and HMM2Vec algorithms. The classification algorithms trained on our embeddings consist of Resnet-18 CNN, Random Forest, Support Vector Machines (SVMs), and k-Nearest Neighbours (k-NNs).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 Silicon Valley Cybersecurity Conference (SVCC)

自引率

0.00%

发文量