{"title":"针对恶意软件检测的大型训练数据集类噪声处理","authors":"Dragos Gavrilut, Liviu Ciortuz","doi":"10.1109/SYNASC.2011.39","DOIUrl":null,"url":null,"abstract":"This paper presents the ways we explored until now for detecting and dealing with the class noise found in large annotated datasets used for training the classifiers that we have previously designed for industrial-scale malware identification. First we established a number of distance-based filtering rules that allow us to identify different \"levels'' of potential noise in the training data, and secondly we analysed the effects produced by either removal or \"cleaning'' of the potentially-noised records on the performances of our simplest classifiers. We show that a careful distance-based filtering can lead to sensibly better results in malware detection.","PeriodicalId":184344,"journal":{"name":"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Dealing with Class Noise in Large Training Datasets for Malware Detection\",\"authors\":\"Dragos Gavrilut, Liviu Ciortuz\",\"doi\":\"10.1109/SYNASC.2011.39\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the ways we explored until now for detecting and dealing with the class noise found in large annotated datasets used for training the classifiers that we have previously designed for industrial-scale malware identification. First we established a number of distance-based filtering rules that allow us to identify different \\\"levels'' of potential noise in the training data, and secondly we analysed the effects produced by either removal or \\\"cleaning'' of the potentially-noised records on the performances of our simplest classifiers. We show that a careful distance-based filtering can lead to sensibly better results in malware detection.\",\"PeriodicalId\":184344,\"journal\":{\"name\":\"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SYNASC.2011.39\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2011.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Dealing with Class Noise in Large Training Datasets for Malware Detection
This paper presents the ways we explored until now for detecting and dealing with the class noise found in large annotated datasets used for training the classifiers that we have previously designed for industrial-scale malware identification. First we established a number of distance-based filtering rules that allow us to identify different "levels'' of potential noise in the training data, and secondly we analysed the effects produced by either removal or "cleaning'' of the potentially-noised records on the performances of our simplest classifiers. We show that a careful distance-based filtering can lead to sensibly better results in malware detection.