Malicious PDF detection using metadata and structural features

Asia-Pacific Computer Systems Architecture Conference Pub Date : 2012-12-03 DOI:10.1145/2420950.2420987

Charles Smutz, A. Stavrou

{"title":"Malicious PDF detection using metadata and structural features","authors":"Charles Smutz, A. Stavrou","doi":"10.1145/2420950.2420987","DOIUrl":null,"url":null,"abstract":"Owed to their versatile functionality and widespread adoption, PDF documents have become a popular avenue for user exploitation ranging from large-scale phishing attacks to targeted attacks. In this paper, we present a framework for robust detection of malicious documents through machine learning. Our approach is based on features extracted from document metadata and structure. Using real-world datasets, we demonstrate the the adequacy of these document properties for malware detection and the durability of these features across new malware variants. Our analysis shows that the Random Forests classification method, an ensemble classifier that randomly selects features for each individual classification tree, yields the best detection rates, even on previously unseen malware.\n Indeed, using multiple datasets containing an aggregate of over 5,000 unique malicious documents and over 100,000 benign ones, our classification rates remain well above 99% while maintaining low false positives of 0.2% or less for different classification parameters and experimental scenarios. Moreover, the classifier has the ability to detect documents crafted for targeted attacks and separate them from broadly distributed malicious PDF documents. Remarkably, we also discovered that by artificially reducing the influence of the top features in the classifier, we can still achieve a high rate of detection in an adversarial setting where the attacker is aware of both the top features utilized in the classifier and our normality model. Thus, the classifier is resilient against mimicry attacks even with knowledge of the document features, classification method, and training set.","PeriodicalId":397003,"journal":{"name":"Asia-Pacific Computer Systems Architecture Conference","volume":"65 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"262","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asia-Pacific Computer Systems Architecture Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2420950.2420987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 262

Abstract

Owed to their versatile functionality and widespread adoption, PDF documents have become a popular avenue for user exploitation ranging from large-scale phishing attacks to targeted attacks. In this paper, we present a framework for robust detection of malicious documents through machine learning. Our approach is based on features extracted from document metadata and structure. Using real-world datasets, we demonstrate the the adequacy of these document properties for malware detection and the durability of these features across new malware variants. Our analysis shows that the Random Forests classification method, an ensemble classifier that randomly selects features for each individual classification tree, yields the best detection rates, even on previously unseen malware. Indeed, using multiple datasets containing an aggregate of over 5,000 unique malicious documents and over 100,000 benign ones, our classification rates remain well above 99% while maintaining low false positives of 0.2% or less for different classification parameters and experimental scenarios. Moreover, the classifier has the ability to detect documents crafted for targeted attacks and separate them from broadly distributed malicious PDF documents. Remarkably, we also discovered that by artificially reducing the influence of the top features in the classifier, we can still achieve a high rate of detection in an adversarial setting where the attacker is aware of both the top features utilized in the classifier and our normality model. Thus, the classifier is resilient against mimicry attacks even with knowledge of the document features, classification method, and training set.

查看原文本刊更多论文

恶意PDF检测使用元数据和结构特征

由于其多功能和广泛采用，PDF文档已成为从大规模网络钓鱼攻击到目标攻击的用户利用的流行途径。在本文中，我们提出了一个通过机器学习对恶意文档进行鲁棒检测的框架。我们的方法基于从文档元数据和结构中提取的特征。使用真实世界的数据集，我们展示了这些文档属性对于恶意软件检测的充分性，以及这些特性在新的恶意软件变体中的持久性。我们的分析表明，随机森林分类方法，一个为每个单独的分类树随机选择特征的集成分类器，产生了最好的检测率，即使是以前看不见的恶意软件。事实上，使用包含超过5000个独特的恶意文档和超过10万个良性文档的多个数据集，我们的分类率保持在99%以上，同时在不同的分类参数和实验场景下保持0.2%或更低的假阳性。此外，分类器能够检测为目标攻击而精心制作的文档，并将它们与广泛分布的恶意PDF文档分开。值得注意的是，我们还发现，通过人为地减少分类器中顶级特征的影响，我们仍然可以在攻击者知道分类器中使用的顶级特征和我们的正态性模型的对抗性设置中实现高检测率。因此，即使知道文档特征、分类方法和训练集，分类器也能抵御模仿攻击。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Asia-Pacific Computer Systems Architecture Conference

自引率

0.00%

发文量