{"title":"A Method of Constructing Malware Classification Dataset Using Clustering","authors":"Woo-Jin Joe, Hyong-Shik Kim","doi":"10.1109/TPS-ISA56441.2022.00025","DOIUrl":null,"url":null,"abstract":"Machine learning, which automatically learns models from data, is receiving a lot of attention as a solution to cope with the increasing number of malicious codes every year. However, since most malicious codes are variants developed by recycling existing malicious codes, there is a problem that the model is easily overfitted to the training set compared to other domains. Previous studies have tried to remove the variants using labels provided by vaccines, but it can lead to indiscriminate removal of malicious codes since the vaccine label is inaccurate. Therefore, we propose a method of constructing a dataset by performing clustering and randomly selecting one from a cluster. To demonstrate that the proposed method of constructing training set can prevent overfitting and improve the generalization performance, we experimented with three training sets: a set that variants are not removed, a set that duplicated families are removed using labels, and a set that duplicated families are removed by the proposed method. To measure generalization performance, we experimented with six test sets constructed by the similarity to the training sets. It was confirmed that models learned from the training set constructed by the proposed method performed better on four test sets than the other models.","PeriodicalId":427887,"journal":{"name":"2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPS-ISA56441.2022.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning, which automatically learns models from data, is receiving a lot of attention as a solution to cope with the increasing number of malicious codes every year. However, since most malicious codes are variants developed by recycling existing malicious codes, there is a problem that the model is easily overfitted to the training set compared to other domains. Previous studies have tried to remove the variants using labels provided by vaccines, but it can lead to indiscriminate removal of malicious codes since the vaccine label is inaccurate. Therefore, we propose a method of constructing a dataset by performing clustering and randomly selecting one from a cluster. To demonstrate that the proposed method of constructing training set can prevent overfitting and improve the generalization performance, we experimented with three training sets: a set that variants are not removed, a set that duplicated families are removed using labels, and a set that duplicated families are removed by the proposed method. To measure generalization performance, we experimented with six test sets constructed by the similarity to the training sets. It was confirmed that models learned from the training set constructed by the proposed method performed better on four test sets than the other models.