A Method of Constructing Malware Classification Dataset Using Clustering

Woo-Jin Joe, Hyong-Shik Kim
{"title":"A Method of Constructing Malware Classification Dataset Using Clustering","authors":"Woo-Jin Joe, Hyong-Shik Kim","doi":"10.1109/TPS-ISA56441.2022.00025","DOIUrl":null,"url":null,"abstract":"Machine learning, which automatically learns models from data, is receiving a lot of attention as a solution to cope with the increasing number of malicious codes every year. However, since most malicious codes are variants developed by recycling existing malicious codes, there is a problem that the model is easily overfitted to the training set compared to other domains. Previous studies have tried to remove the variants using labels provided by vaccines, but it can lead to indiscriminate removal of malicious codes since the vaccine label is inaccurate. Therefore, we propose a method of constructing a dataset by performing clustering and randomly selecting one from a cluster. To demonstrate that the proposed method of constructing training set can prevent overfitting and improve the generalization performance, we experimented with three training sets: a set that variants are not removed, a set that duplicated families are removed using labels, and a set that duplicated families are removed by the proposed method. To measure generalization performance, we experimented with six test sets constructed by the similarity to the training sets. It was confirmed that models learned from the training set constructed by the proposed method performed better on four test sets than the other models.","PeriodicalId":427887,"journal":{"name":"2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPS-ISA56441.2022.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning, which automatically learns models from data, is receiving a lot of attention as a solution to cope with the increasing number of malicious codes every year. However, since most malicious codes are variants developed by recycling existing malicious codes, there is a problem that the model is easily overfitted to the training set compared to other domains. Previous studies have tried to remove the variants using labels provided by vaccines, but it can lead to indiscriminate removal of malicious codes since the vaccine label is inaccurate. Therefore, we propose a method of constructing a dataset by performing clustering and randomly selecting one from a cluster. To demonstrate that the proposed method of constructing training set can prevent overfitting and improve the generalization performance, we experimented with three training sets: a set that variants are not removed, a set that duplicated families are removed using labels, and a set that duplicated families are removed by the proposed method. To measure generalization performance, we experimented with six test sets constructed by the similarity to the training sets. It was confirmed that models learned from the training set constructed by the proposed method performed better on four test sets than the other models.
基于聚类的恶意软件分类数据集构建方法
机器学习是一种从数据中自动学习模型的技术,作为应对每年不断增加的恶意代码的解决方案,它受到了很多关注。然而,由于大多数恶意代码是通过回收现有恶意代码而开发的变体,因此与其他领域相比,存在模型容易过度拟合到训练集的问题。以前的研究试图使用疫苗提供的标签来去除变异,但由于疫苗标签不准确,它可能导致不加区分地去除恶意代码。因此,我们提出了一种通过执行聚类并从聚类中随机选择一个数据集来构建数据集的方法。为了证明所提出的训练集构造方法可以防止过拟合并提高泛化性能,我们对三个训练集进行了实验:一个是不去除变量的训练集,一个是使用标签去除重复家族的训练集,一个是使用所提出的方法去除重复家族的训练集。为了衡量泛化性能,我们用与训练集相似度构建的六个测试集进行了实验。结果表明,从该方法构造的训练集学习到的模型在四个测试集上的表现优于其他模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信