Wavelet decomposition of software entropy reveals symptoms of malicious code

Journal of Innovation in Digital Ecosystems Pub Date : 2016-12-01 DOI:10.1016/j.jides.2016.10.009

Michael Wojnowicz, Glenn Chisholm, Matt Wolff, Xuan Zhao

{"title":"Wavelet decomposition of software entropy reveals symptoms of malicious code","authors":"Michael Wojnowicz, Glenn Chisholm, Matt Wolff, Xuan Zhao","doi":"10.1016/j.jides.2016.10.009","DOIUrl":null,"url":null,"abstract":"<div><p>Sophisticated malware authors can sneak hidden malicious contents into portable executable files, and this contents can be hard to detect, especially if encrypted or compressed. However, when an executable file switches between contents regimes (e.g., native, encrypted, compressed, text, and padding), there are corresponding shifts in the file’s representation as an entropy signal. In this paper, we develop a method for automatically quantifying the extent to which patterned variations in a file’s entropy signal make it “suspicious”. In Experiment 1, we use wavelet transforms to define a Suspiciously Structured Entropic Change Score (SSECS), a scalar feature that quantifies the suspiciousness of a file based on its distribution of entropic energy across multiple levels of spatial resolution. Based on this single feature, it was possible to raise predictive accuracy on a malware detection task from 50.0% to 68.7%, even though the single feature was applied to a heterogeneous corpus of malware discovered “in the wild”. In Experiment 2, we describe how wavelet-based decompositions of software entropy can be applied to a parasitic malware detection task involving large numbers of samples and features. By extracting only string and entropy features (with wavelet decompositions) from software samples, we are able to obtain almost 99% detection of parasitic malware with fewer than 1% false positives on good files. Moreover, the addition of wavelet-based features uniformly improved detection performance across plausible false positive rates, both in a strings-only model (e.g., from 80.90% to 82.97%) and a strings-plus-entropy model (e.g. from 92.10% to 94.74%, and from 98.63% to 98.90%). Overall, wavelet decomposition of software entropy can be useful for machine learning models for detecting malware based on extracting millions of features from executable files.</p></div>","PeriodicalId":100792,"journal":{"name":"Journal of Innovation in Digital Ecosystems","volume":"3 2","pages":"Pages 130-140"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.jides.2016.10.009","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Innovation in Digital Ecosystems","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352664516300220","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

Sophisticated malware authors can sneak hidden malicious contents into portable executable files, and this contents can be hard to detect, especially if encrypted or compressed. However, when an executable file switches between contents regimes (e.g., native, encrypted, compressed, text, and padding), there are corresponding shifts in the file’s representation as an entropy signal. In this paper, we develop a method for automatically quantifying the extent to which patterned variations in a file’s entropy signal make it “suspicious”. In Experiment 1, we use wavelet transforms to define a Suspiciously Structured Entropic Change Score (SSECS), a scalar feature that quantifies the suspiciousness of a file based on its distribution of entropic energy across multiple levels of spatial resolution. Based on this single feature, it was possible to raise predictive accuracy on a malware detection task from 50.0% to 68.7%, even though the single feature was applied to a heterogeneous corpus of malware discovered “in the wild”. In Experiment 2, we describe how wavelet-based decompositions of software entropy can be applied to a parasitic malware detection task involving large numbers of samples and features. By extracting only string and entropy features (with wavelet decompositions) from software samples, we are able to obtain almost 99% detection of parasitic malware with fewer than 1% false positives on good files. Moreover, the addition of wavelet-based features uniformly improved detection performance across plausible false positive rates, both in a strings-only model (e.g., from 80.90% to 82.97%) and a strings-plus-entropy model (e.g. from 92.10% to 94.74%, and from 98.63% to 98.90%). Overall, wavelet decomposition of software entropy can be useful for machine learning models for detecting malware based on extracting millions of features from executable files.

查看原文本刊更多论文

软件熵的小波分解揭示了恶意代码的症状

复杂的恶意软件作者可以将隐藏的恶意内容潜入可移植的可执行文件中，并且这些内容很难检测到，特别是在加密或压缩的情况下。然而，当一个可执行文件在内容体系之间切换时(例如，原生的、加密的、压缩的、文本的和填充的)，在文件的表示中有相应的变化作为熵信号。在本文中，我们开发了一种方法来自动量化文件的熵信号中的模式变化使其“可疑”的程度。在实验1中，我们使用小波变换来定义可疑结构熵变化评分(SSECS)，这是一个标量特征，根据文件在多个空间分辨率上的熵能分布来量化文件的可疑性。基于这一单一特征，可以将恶意软件检测任务的预测准确率从50.0%提高到68.7%，即使该单一特征应用于“在野外”发现的恶意软件的异构语料库。在实验2中，我们描述了如何将基于小波的软件熵分解应用于涉及大量样本和特征的寄生恶意软件检测任务。通过仅从软件样本中提取字符串和熵特征(使用小波分解)，我们能够获得几乎99%的寄生恶意软件检测，并且在良好文件上的误报率低于1%。此外，基于小波的特征的添加均匀地提高了在可能的假阳性率下的检测性能，无论是在纯字符串模型(例如，从80.90%到82.97%)还是字符串加熵模型(例如，从92.10%到94.74%，从98.63%到98.90%)。总的来说，软件熵的小波分解可以用于机器学习模型，用于检测基于从可执行文件中提取数百万个特征的恶意软件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Innovation in Digital Ecosystems

自引率

0.00%

发文量