Android恶意软件检测:构建有用的表示

2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2016-12-01 DOI:10.1109/ICMLA.2016.0041

L. Sayfullina, Emil Eirola, Dmitry Komashinsky, Paolo Palumbo, J. Karhunen

{"title":"Android恶意软件检测:构建有用的表示","authors":"L. Sayfullina, Emil Eirola, Dmitry Komashinsky, Paolo Palumbo, J. Karhunen","doi":"10.1109/ICMLA.2016.0041","DOIUrl":null,"url":null,"abstract":"The problem of proactively detecting Android Malware has proven to be a challenging one. The challenges stem from a variety of issues, but recent literature has shown that this task is hard to solve with high accuracy when only a restricted set of features, like permissions or similar fixed sets of features, are used. The opposite approach of including all available features is also problematic, as it causes the features space to grow beyond reasonable size. In this paper we focus on finding an efficient way to select a representative feature space, preserving its discriminative power on unseen data. We go beyond traditional approaches like Principal Component Analysis, which is too heavy for large-scale problems with millions of features. In particular we show that many feature groups that can be extracted from Android application packages, like features extracted from the manifest file or strings extracted from the Dalvik Executable (DEX), should be filtered and used in classification separately. Our proposed dimensionality reduction scheme is applied to each group separately and consists of raw string preprocessing, feature selection via log-odds and finally applying random projections. With the size of the feature space growing exponentially as a function of the training set's size, our approach drastically decreases the size of the feature space of several orders of magnitude, this in turn allows accurate classification to become possible in a real world scenario. After reducing the dimensionality we use the feature groups in a light-weight ensemble of logistic classifiers. We evaluated the proposed classification scheme on real malware data provided by the antivirus vendor and achieved state-of-the-art 88.24% true positive and reasonably low 0.04% false positive rates with a significantly compressed feature space on a balanced test set of 10,000 samples.","PeriodicalId":356182,"journal":{"name":"2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"72 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Android Malware Detection: Building Useful Representations\",\"authors\":\"L. Sayfullina, Emil Eirola, Dmitry Komashinsky, Paolo Palumbo, J. Karhunen\",\"doi\":\"10.1109/ICMLA.2016.0041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The problem of proactively detecting Android Malware has proven to be a challenging one. The challenges stem from a variety of issues, but recent literature has shown that this task is hard to solve with high accuracy when only a restricted set of features, like permissions or similar fixed sets of features, are used. The opposite approach of including all available features is also problematic, as it causes the features space to grow beyond reasonable size. In this paper we focus on finding an efficient way to select a representative feature space, preserving its discriminative power on unseen data. We go beyond traditional approaches like Principal Component Analysis, which is too heavy for large-scale problems with millions of features. In particular we show that many feature groups that can be extracted from Android application packages, like features extracted from the manifest file or strings extracted from the Dalvik Executable (DEX), should be filtered and used in classification separately. Our proposed dimensionality reduction scheme is applied to each group separately and consists of raw string preprocessing, feature selection via log-odds and finally applying random projections. With the size of the feature space growing exponentially as a function of the training set's size, our approach drastically decreases the size of the feature space of several orders of magnitude, this in turn allows accurate classification to become possible in a real world scenario. After reducing the dimensionality we use the feature groups in a light-weight ensemble of logistic classifiers. We evaluated the proposed classification scheme on real malware data provided by the antivirus vendor and achieved state-of-the-art 88.24% true positive and reasonably low 0.04% false positive rates with a significantly compressed feature space on a balanced test set of 10,000 samples.\",\"PeriodicalId\":356182,\"journal\":{\"name\":\"2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"72 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2016.0041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2016.0041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

主动检测Android恶意软件已被证明是一个具有挑战性的问题。挑战源于各种各样的问题，但最近的文献表明，当只使用一组受限的特征(如权限或类似的固定特征集)时，这项任务很难以高精度解决。包含所有可用功能的相反方法也是有问题的，因为它会导致功能空间超出合理的大小。本文的重点是寻找一种有效的方法来选择具有代表性的特征空间，同时保持其对未知数据的判别能力。我们超越了传统的方法，如主成分分析，这对于具有数百万个特征的大规模问题来说太重了。我们特别指出，许多可以从Android应用程序包中提取的功能组，如从manifest文件中提取的功能或从Dalvik Executable (DEX)中提取的字符串，应该被过滤并单独用于分类。我们提出的降维方案分别应用于每个组，包括原始字符串预处理，通过对数赔率进行特征选择，最后应用随机投影。随着特征空间的大小作为训练集大小的函数呈指数增长，我们的方法大大减少了几个数量级的特征空间的大小，这反过来又使得在现实世界场景中实现准确分类成为可能。在降维之后，我们在一个轻量级的逻辑分类器集合中使用特征组。我们在反病毒供应商提供的真实恶意软件数据上评估了所提出的分类方案，在10,000个样本的平衡测试集上，在显著压缩的特征空间下，实现了最先进的88.24%真阳性和相当低的0.04%假阳性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Android Malware Detection: Building Useful Representations

The problem of proactively detecting Android Malware has proven to be a challenging one. The challenges stem from a variety of issues, but recent literature has shown that this task is hard to solve with high accuracy when only a restricted set of features, like permissions or similar fixed sets of features, are used. The opposite approach of including all available features is also problematic, as it causes the features space to grow beyond reasonable size. In this paper we focus on finding an efficient way to select a representative feature space, preserving its discriminative power on unseen data. We go beyond traditional approaches like Principal Component Analysis, which is too heavy for large-scale problems with millions of features. In particular we show that many feature groups that can be extracted from Android application packages, like features extracted from the manifest file or strings extracted from the Dalvik Executable (DEX), should be filtered and used in classification separately. Our proposed dimensionality reduction scheme is applied to each group separately and consists of raw string preprocessing, feature selection via log-odds and finally applying random projections. With the size of the feature space growing exponentially as a function of the training set's size, our approach drastically decreases the size of the feature space of several orders of magnitude, this in turn allows accurate classification to become possible in a real world scenario. After reducing the dimensionality we use the feature groups in a light-weight ensemble of logistic classifiers. We evaluated the proposed classification scheme on real malware data provided by the antivirus vendor and achieved state-of-the-art 88.24% true positive and reasonably low 0.04% false positive rates with a significantly compressed feature space on a balanced test set of 10,000 samples.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量