{"title":"基于情境化向量嵌入的恶意软件检测","authors":"Vinay Pandya, Fabio Di Troia","doi":"10.1109/SVCC56964.2023.10165170","DOIUrl":null,"url":null,"abstract":"Detecting malware is an integral part of system security. In recent years, machine learning models have been applied with success to overcome this challenging problem. The aim of this research is to apply context-dependent word embeddings to classify malware. We extract opcodes from the malware samples and use them to generate the embeddings that train the classifiers. Transformers are a novel architecture that utilizes self-attention to handle long-range dependencies. Different transformer architectures, namely, BERT, DistilBERT, AIBERT, and RoBERTa, are implemented in this work to generate context-dependent word embeddings. Apart from using transformer models, we also experimented with ELMo, a bidirectional language model which can generate contextualized opcode embeddings. These embeddings are used to train our machine learning models in classifying samples from different malware families. We compared our contextualized results with context-free embeddings generated by Word2Vec, and HMM2Vec algorithms. The classification algorithms trained on our embeddings consist of Resnet-18 CNN, Random Forest, Support Vector Machines (SVMs), and k-Nearest Neighbours (k-NNs).","PeriodicalId":243155,"journal":{"name":"2023 Silicon Valley Cybersecurity Conference (SVCC)","volume":"312 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Malware Detection through Contextualized Vector Embeddings\",\"authors\":\"Vinay Pandya, Fabio Di Troia\",\"doi\":\"10.1109/SVCC56964.2023.10165170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting malware is an integral part of system security. In recent years, machine learning models have been applied with success to overcome this challenging problem. The aim of this research is to apply context-dependent word embeddings to classify malware. We extract opcodes from the malware samples and use them to generate the embeddings that train the classifiers. Transformers are a novel architecture that utilizes self-attention to handle long-range dependencies. Different transformer architectures, namely, BERT, DistilBERT, AIBERT, and RoBERTa, are implemented in this work to generate context-dependent word embeddings. Apart from using transformer models, we also experimented with ELMo, a bidirectional language model which can generate contextualized opcode embeddings. These embeddings are used to train our machine learning models in classifying samples from different malware families. We compared our contextualized results with context-free embeddings generated by Word2Vec, and HMM2Vec algorithms. The classification algorithms trained on our embeddings consist of Resnet-18 CNN, Random Forest, Support Vector Machines (SVMs), and k-Nearest Neighbours (k-NNs).\",\"PeriodicalId\":243155,\"journal\":{\"name\":\"2023 Silicon Valley Cybersecurity Conference (SVCC)\",\"volume\":\"312 \",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Silicon Valley Cybersecurity Conference (SVCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SVCC56964.2023.10165170\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Silicon Valley Cybersecurity Conference (SVCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SVCC56964.2023.10165170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Malware Detection through Contextualized Vector Embeddings
Detecting malware is an integral part of system security. In recent years, machine learning models have been applied with success to overcome this challenging problem. The aim of this research is to apply context-dependent word embeddings to classify malware. We extract opcodes from the malware samples and use them to generate the embeddings that train the classifiers. Transformers are a novel architecture that utilizes self-attention to handle long-range dependencies. Different transformer architectures, namely, BERT, DistilBERT, AIBERT, and RoBERTa, are implemented in this work to generate context-dependent word embeddings. Apart from using transformer models, we also experimented with ELMo, a bidirectional language model which can generate contextualized opcode embeddings. These embeddings are used to train our machine learning models in classifying samples from different malware families. We compared our contextualized results with context-free embeddings generated by Word2Vec, and HMM2Vec algorithms. The classification algorithms trained on our embeddings consist of Resnet-18 CNN, Random Forest, Support Vector Machines (SVMs), and k-Nearest Neighbours (k-NNs).