From Grim Reality to Practical Solution: Malware Classification in Real-World Noise

2023 IEEE Symposium on Security and Privacy (SP) Pub Date : 2023-05-01 DOI:10.1109/SP46215.2023.10179453

Xian Wu, Wenbo Guo, Jia Yan, Baris Coskun, Xinyu Xing

{"title":"From Grim Reality to Practical Solution: Malware Classification in Real-World Noise","authors":"Xian Wu, Wenbo Guo, Jia Yan, Baris Coskun, Xinyu Xing","doi":"10.1109/SP46215.2023.10179453","DOIUrl":null,"url":null,"abstract":"Malware datasets inevitably contain incorrect labels due to the shortage of expertise and experience needed for sample labeling. Previous research demonstrated that a training dataset with incorrectly labeled samples would result in inaccurate model learning. To address this problem, researchers have proposed various noise learning methods to offset the impact of incorrectly labeled samples, and in image recognition and text mining applications, these methods demonstrated great success. In this work, we apply both representative and state-of-the-art noise learning methods to real-world malware classification tasks. We surprisingly observe that none of the existing methods could minimize incorrect labels’ impact. Through a carefully designed experiment, we discover that the inefficacy mainly results from extreme data imbalance and the high percentage of incorrectly labeled data samples. As such, we further propose a new noise learning method and name it after MORSE. Unlike existing methods, MORSE customizes and extends a state-of-the-art semi-supervised learning technique. It takes possibly incorrectly labeled data as unlabeled data and thus avoids their potential negative impact on model learning. In MORSE, we also integrate a sample re-weighting method that balances the training data usage in the model learning and thus handles the data imbalance challenge. We evaluate MORSE on both our synthesized and real-world datasets. We show that MORSE could significantly outperform existing noise learning methods and minimize the impact of incorrectly labeled data.","PeriodicalId":439989,"journal":{"name":"2023 IEEE Symposium on Security and Privacy (SP)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Symposium on Security and Privacy (SP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SP46215.2023.10179453","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Malware datasets inevitably contain incorrect labels due to the shortage of expertise and experience needed for sample labeling. Previous research demonstrated that a training dataset with incorrectly labeled samples would result in inaccurate model learning. To address this problem, researchers have proposed various noise learning methods to offset the impact of incorrectly labeled samples, and in image recognition and text mining applications, these methods demonstrated great success. In this work, we apply both representative and state-of-the-art noise learning methods to real-world malware classification tasks. We surprisingly observe that none of the existing methods could minimize incorrect labels’ impact. Through a carefully designed experiment, we discover that the inefficacy mainly results from extreme data imbalance and the high percentage of incorrectly labeled data samples. As such, we further propose a new noise learning method and name it after MORSE. Unlike existing methods, MORSE customizes and extends a state-of-the-art semi-supervised learning technique. It takes possibly incorrectly labeled data as unlabeled data and thus avoids their potential negative impact on model learning. In MORSE, we also integrate a sample re-weighting method that balances the training data usage in the model learning and thus handles the data imbalance challenge. We evaluate MORSE on both our synthesized and real-world datasets. We show that MORSE could significantly outperform existing noise learning methods and minimize the impact of incorrectly labeled data.

查看原文本刊更多论文

从严峻的现实到实用的解决方案:恶意软件分类在现实世界的噪音

由于缺乏样本标记所需的专业知识和经验，恶意软件数据集不可避免地包含不正确的标签。先前的研究表明，带有错误标记样本的训练数据集将导致不准确的模型学习。为了解决这个问题，研究人员提出了各种噪声学习方法来抵消错误标记样本的影响，并且在图像识别和文本挖掘应用中，这些方法取得了巨大的成功。在这项工作中，我们将代表性和最先进的噪声学习方法应用于现实世界的恶意软件分类任务。我们惊讶地发现，现有的方法都不能最大限度地减少错误标签的影响。通过精心设计的实验，我们发现无效的主要原因是极端的数据不平衡和错误标记数据样本的比例很高。因此，我们进一步提出了一种新的噪声学习方法，并以MORSE命名。与现有的方法不同，MORSE定制并扩展了最先进的半监督学习技术。它将可能被错误标记的数据视为未标记的数据，从而避免了它们对模型学习的潜在负面影响。在MORSE中，我们还集成了一种样本重加权方法，以平衡模型学习中训练数据的使用，从而解决数据不平衡的挑战。我们在我们的合成数据集和实际数据集上评估MORSE。我们表明，MORSE可以显著优于现有的噪声学习方法，并最大限度地减少错误标记数据的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE Symposium on Security and Privacy (SP)

自引率

0.00%

发文量