Static Analysis through Topic Modeling and its Application to Malware Programs Classification

2019 IEEE National Aerospace and Electronics Conference (NAECON) Pub Date : 2019-07-01 DOI:10.1109/NAECON46414.2019.9057876

Ouboti Djaneye-Boundjou, Temesguen Messay-Kebede, David Kapp, Jeremiah Greer, A. Ralescu

{"title":"Static Analysis through Topic Modeling and its Application to Malware Programs Classification","authors":"Ouboti Djaneye-Boundjou, Temesguen Messay-Kebede, David Kapp, Jeremiah Greer, A. Ralescu","doi":"10.1109/NAECON46414.2019.9057876","DOIUrl":null,"url":null,"abstract":"We perform static analysis of malware programs in the BIG 2015 dataset, a repository containing nine different families of malware programs. Our main goal is to provide a framework for classification of the programs in the dataset. Our analysis of the programs is static in the sense that the contents of the said programs are looked at and their representations are constructed without executing the programs. More precisely, assembly language opcodes are extracted from the programs in the dataset and concatenated in order to construct documents representing these programs. Opcodes being words, we then employ Natural Language Processing tools and techniques for analysis of the documents. Mainly, the Latent Dirichlet Allocation (LDA) algorithm is used to model documents as weighted mixtures of a fixed number of topics. A topic is a collection of words grouped together for their ability to capture meaningful attributes about the documents. We note that the weight distribution of topics within documents of the same family (visually) shows a common pattern that seemingly varies from one family to another. This, therefore, aids in justifying the use of the LDA technique as a feature extraction method, with the features here being the weights of the topics representing each and every document. Ensuing, after training a fine k-nearest neighbors classifier, which takes topic weights as inputs, testing results show a 97.2% classification accuracy, thereby attesting to the efficacy of the overall approach.","PeriodicalId":193529,"journal":{"name":"2019 IEEE National Aerospace and Electronics Conference (NAECON)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE National Aerospace and Electronics Conference (NAECON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NAECON46414.2019.9057876","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

We perform static analysis of malware programs in the BIG 2015 dataset, a repository containing nine different families of malware programs. Our main goal is to provide a framework for classification of the programs in the dataset. Our analysis of the programs is static in the sense that the contents of the said programs are looked at and their representations are constructed without executing the programs. More precisely, assembly language opcodes are extracted from the programs in the dataset and concatenated in order to construct documents representing these programs. Opcodes being words, we then employ Natural Language Processing tools and techniques for analysis of the documents. Mainly, the Latent Dirichlet Allocation (LDA) algorithm is used to model documents as weighted mixtures of a fixed number of topics. A topic is a collection of words grouped together for their ability to capture meaningful attributes about the documents. We note that the weight distribution of topics within documents of the same family (visually) shows a common pattern that seemingly varies from one family to another. This, therefore, aids in justifying the use of the LDA technique as a feature extraction method, with the features here being the weights of the topics representing each and every document. Ensuing, after training a fine k-nearest neighbors classifier, which takes topic weights as inputs, testing results show a 97.2% classification accuracy, thereby attesting to the efficacy of the overall approach.

查看原文本刊更多论文

主题建模静态分析及其在恶意程序分类中的应用

我们对BIG 2015数据集中的恶意软件程序进行静态分析，该数据集包含九个不同的恶意软件程序家族。我们的主要目标是为数据集中的程序分类提供一个框架。我们对程序的分析是静态的，因为所述程序的内容被查看，它们的表示是在不执行程序的情况下构造的。更准确地说，从数据集中的程序中提取汇编语言操作码，并将其连接起来，以构建表示这些程序的文档。操作码是单词，然后我们使用自然语言处理工具和技术来分析文档。主要使用潜狄利克雷分配(Latent Dirichlet Allocation, LDA)算法将文档建模为固定数量主题的加权混合物。主题是一组单词的集合，这些单词能够捕获关于文档的有意义的属性。我们注意到，同一家族文档中主题的权重分布(视觉上)显示了一种共同的模式，这种模式似乎因家族而异。因此，这有助于证明使用LDA技术作为特征提取方法的合理性，这里的特征是表示每个文档的主题的权重。随后，在训练了一个以主题权重为输入的k近邻分类器后，测试结果显示分类准确率为97.2%，从而证明了整体方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE National Aerospace and Electronics Conference (NAECON)

自引率

0.00%

发文量