RmvDroid: Towards A Reliable Android Malware Dataset with App Metadata

Haoyu Wang, Junjun Si, Hao Li, Yao Guo
{"title":"RmvDroid: Towards A Reliable Android Malware Dataset with App Metadata","authors":"Haoyu Wang, Junjun Si, Hao Li, Yao Guo","doi":"10.1109/MSR.2019.00067","DOIUrl":null,"url":null,"abstract":"A large number of research studies have been focused on detecting Android malware in recent years. As a result, a reliable and large-scale malware dataset is essential to build effective malware classifiers and evaluate the performance of different detection techniques. Although several Android malware benchmarks have been widely used in our research community, these benchmarks face several major limitations. First, most of the existing datasets are outdated and cannot reflect current malware evolution trends. Second, most of them only rely on VirusTotal to label the ground truth of malware, while some anti-virus engines on VirusTotal may not always report reliable results. Third, all of them only contain the apps themselves (apks), while other important app information (e.g., app description, user rating, and app installs) is missing, which greatly limits the usage scenarios of these datasets. In this paper, we have created a reliable Android malware dataset based on Google Play's app maintenance results over several years. We first created four snapshots of Google Play in 2014, 2015, 2017 and 2018 respectively. Then we use VirusTotal to label apps with possible sensitive behaviors, and monitor these apps on Google Play to see whether Google has removed them or not. Based on this approach, we have created a malware dataset containing 9,133 samples that belong to 56 malware families with high confidence. We believe this dataset will boost a series of research studies including Android malware detection and classification, mining apps for anomalies, and app store mining, etc.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"6 1","pages":"404-408"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"59","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2019.00067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 59

Abstract

A large number of research studies have been focused on detecting Android malware in recent years. As a result, a reliable and large-scale malware dataset is essential to build effective malware classifiers and evaluate the performance of different detection techniques. Although several Android malware benchmarks have been widely used in our research community, these benchmarks face several major limitations. First, most of the existing datasets are outdated and cannot reflect current malware evolution trends. Second, most of them only rely on VirusTotal to label the ground truth of malware, while some anti-virus engines on VirusTotal may not always report reliable results. Third, all of them only contain the apps themselves (apks), while other important app information (e.g., app description, user rating, and app installs) is missing, which greatly limits the usage scenarios of these datasets. In this paper, we have created a reliable Android malware dataset based on Google Play's app maintenance results over several years. We first created four snapshots of Google Play in 2014, 2015, 2017 and 2018 respectively. Then we use VirusTotal to label apps with possible sensitive behaviors, and monitor these apps on Google Play to see whether Google has removed them or not. Based on this approach, we have created a malware dataset containing 9,133 samples that belong to 56 malware families with high confidence. We believe this dataset will boost a series of research studies including Android malware detection and classification, mining apps for anomalies, and app store mining, etc.
RmvDroid:迈向一个可靠的Android恶意软件数据集与应用元数据
近年来,大量的研究集中在检测Android恶意软件上。因此,一个可靠和大规模的恶意软件数据集对于构建有效的恶意软件分类器和评估不同检测技术的性能至关重要。尽管在我们的研究社区中已经广泛使用了几个Android恶意软件基准测试,但这些基准测试面临几个主要的限制。首先,大多数现有数据集已经过时,无法反映当前恶意软件的发展趋势。其次,他们中的大多数只依赖VirusTotal来标记恶意软件的基本真相,而VirusTotal上的一些反病毒引擎可能并不总是报告可靠的结果。第三,它们都只包含应用本身(apk),而其他重要的应用信息(如应用描述、用户评分和应用安装)则缺失,这极大地限制了这些数据集的使用场景。在本文中,我们基于Google Play多年来的应用维护结果创建了一个可靠的Android恶意软件数据集。我们首先分别在2014年、2015年、2017年和2018年创建了4个Google Play快照。然后,我们使用VirusTotal来标记可能存在敏感行为的应用,并在Google Play上监控这些应用,看看谷歌是否已将其删除。基于这种方法,我们创建了一个包含9133个样本的恶意软件数据集,这些样本属于56个恶意软件家族,具有高置信度。我们相信该数据集将推动一系列研究,包括Android恶意软件检测和分类,挖掘异常应用程序和应用商店挖掘等。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信