MalWhiteout: Reducing Label Errors in Android Malware Detection

Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering Pub Date : 2022-10-10 DOI:10.1145/3551349.3560418

Liu Wang, Haoyu Wang, Xiapu Luo, Yulei Sui

{"title":"MalWhiteout: Reducing Label Errors in Android Malware Detection","authors":"Liu Wang, Haoyu Wang, Xiapu Luo, Yulei Sui","doi":"10.1145/3551349.3560418","DOIUrl":null,"url":null,"abstract":"Machine learning based Android malware detection has attracted a great deal of research work in recent years. A reliable malware dataset is critical to evaluate the effectiveness of malware detection approaches. Unfortunately, existing malware datasets used in our community are mainly labelled by leveraging existing anti-virus services (i.e., VirusTotal), which are prone to mislabelling. This, however, would lead to the inaccurate evaluation of the malware detection techniques. Removing label noises from Android malware datasets can be quite challenging, especially at a large data scale. To address this problem, we propose an effective approach called MalWhiteout to reduce label errors in Android malware datasets. Specifically, we creatively introduce Confident Learning (CL), an advanced noise estimation approach, to the domain of Android malware detection. To combat false positives introduced by CL, we incorporate the idea of ensemble learning and inter-app relation to achieve a more robust capability in noise detection. We evaluate MalWhiteout on a curated large-scale and reliable benchmark dataset. Experimental results show that MalWhiteout is capable of detecting label noises with over 94% accuracy even at a high noise ratio (i.e., 30%) of the dataset. MalWhiteout outperforms the state-of-the-art approach in terms of both effectiveness (8% to 218% improvement) and efficiency (70 to 249 times faster) across different settings. By reducing label noises, we show that the performance of existing malware detection approaches can be improved.","PeriodicalId":197939,"journal":{"name":"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3551349.3560418","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Machine learning based Android malware detection has attracted a great deal of research work in recent years. A reliable malware dataset is critical to evaluate the effectiveness of malware detection approaches. Unfortunately, existing malware datasets used in our community are mainly labelled by leveraging existing anti-virus services (i.e., VirusTotal), which are prone to mislabelling. This, however, would lead to the inaccurate evaluation of the malware detection techniques. Removing label noises from Android malware datasets can be quite challenging, especially at a large data scale. To address this problem, we propose an effective approach called MalWhiteout to reduce label errors in Android malware datasets. Specifically, we creatively introduce Confident Learning (CL), an advanced noise estimation approach, to the domain of Android malware detection. To combat false positives introduced by CL, we incorporate the idea of ensemble learning and inter-app relation to achieve a more robust capability in noise detection. We evaluate MalWhiteout on a curated large-scale and reliable benchmark dataset. Experimental results show that MalWhiteout is capable of detecting label noises with over 94% accuracy even at a high noise ratio (i.e., 30%) of the dataset. MalWhiteout outperforms the state-of-the-art approach in terms of both effectiveness (8% to 218% improvement) and efficiency (70 to 249 times faster) across different settings. By reducing label noises, we show that the performance of existing malware detection approaches can be improved.

查看原文本刊更多论文

MalWhiteout:减少Android恶意软件检测中的标签错误

基于机器学习的Android恶意软件检测近年来吸引了大量的研究工作。可靠的恶意软件数据集对于评估恶意软件检测方法的有效性至关重要。不幸的是，我们社区中使用的现有恶意软件数据集主要是通过利用现有的反病毒服务(即VirusTotal)来标记的，这很容易被错误标记。然而，这将导致对恶意软件检测技术的不准确评估。从Android恶意软件数据集中移除标签噪声是相当具有挑战性的，特别是在大数据规模下。为了解决这个问题，我们提出了一种称为MalWhiteout的有效方法来减少Android恶意软件数据集中的标签错误。具体来说，我们创造性地将自信学习(CL)，一种先进的噪声估计方法引入到Android恶意软件检测领域。为了对抗CL引入的误报，我们结合了集成学习和应用间关系的思想，以实现更强大的噪声检测能力。我们在一个精心策划的大规模可靠的基准数据集上评估MalWhiteout。实验结果表明，即使在数据集的高噪比(即30%)下，MalWhiteout也能够以超过94%的准确率检测标签噪声。在不同设置下，MalWhiteout在有效性(提高8%到218%)和效率(提高70到249倍)方面都优于最先进的方法。通过降低标签噪声，我们证明了现有恶意软件检测方法的性能可以得到改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

自引率

0.00%

发文量