A Method of Constructing Malware Classification Dataset Using Clustering

2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA) Pub Date : 2022-12-01 DOI:10.1109/TPS-ISA56441.2022.00025

Woo-Jin Joe, Hyong-Shik Kim

{"title":"A Method of Constructing Malware Classification Dataset Using Clustering","authors":"Woo-Jin Joe, Hyong-Shik Kim","doi":"10.1109/TPS-ISA56441.2022.00025","DOIUrl":null,"url":null,"abstract":"Machine learning, which automatically learns models from data, is receiving a lot of attention as a solution to cope with the increasing number of malicious codes every year. However, since most malicious codes are variants developed by recycling existing malicious codes, there is a problem that the model is easily overfitted to the training set compared to other domains. Previous studies have tried to remove the variants using labels provided by vaccines, but it can lead to indiscriminate removal of malicious codes since the vaccine label is inaccurate. Therefore, we propose a method of constructing a dataset by performing clustering and randomly selecting one from a cluster. To demonstrate that the proposed method of constructing training set can prevent overfitting and improve the generalization performance, we experimented with three training sets: a set that variants are not removed, a set that duplicated families are removed using labels, and a set that duplicated families are removed by the proposed method. To measure generalization performance, we experimented with six test sets constructed by the similarity to the training sets. It was confirmed that models learned from the training set constructed by the proposed method performed better on four test sets than the other models.","PeriodicalId":427887,"journal":{"name":"2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPS-ISA56441.2022.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning, which automatically learns models from data, is receiving a lot of attention as a solution to cope with the increasing number of malicious codes every year. However, since most malicious codes are variants developed by recycling existing malicious codes, there is a problem that the model is easily overfitted to the training set compared to other domains. Previous studies have tried to remove the variants using labels provided by vaccines, but it can lead to indiscriminate removal of malicious codes since the vaccine label is inaccurate. Therefore, we propose a method of constructing a dataset by performing clustering and randomly selecting one from a cluster. To demonstrate that the proposed method of constructing training set can prevent overfitting and improve the generalization performance, we experimented with three training sets: a set that variants are not removed, a set that duplicated families are removed using labels, and a set that duplicated families are removed by the proposed method. To measure generalization performance, we experimented with six test sets constructed by the similarity to the training sets. It was confirmed that models learned from the training set constructed by the proposed method performed better on four test sets than the other models.

查看原文本刊更多论文

基于聚类的恶意软件分类数据集构建方法

机器学习是一种从数据中自动学习模型的技术，作为应对每年不断增加的恶意代码的解决方案，它受到了很多关注。然而，由于大多数恶意代码是通过回收现有恶意代码而开发的变体，因此与其他领域相比，存在模型容易过度拟合到训练集的问题。以前的研究试图使用疫苗提供的标签来去除变异，但由于疫苗标签不准确，它可能导致不加区分地去除恶意代码。因此，我们提出了一种通过执行聚类并从聚类中随机选择一个数据集来构建数据集的方法。为了证明所提出的训练集构造方法可以防止过拟合并提高泛化性能，我们对三个训练集进行了实验:一个是不去除变量的训练集，一个是使用标签去除重复家族的训练集，一个是使用所提出的方法去除重复家族的训练集。为了衡量泛化性能，我们用与训练集相似度构建的六个测试集进行了实验。结果表明，从该方法构造的训练集学习到的模型在四个测试集上的表现优于其他模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA)

自引率

0.00%

发文量