针对恶意软件检测的大型训练数据集类噪声处理

Dragos Gavrilut, Liviu Ciortuz
{"title":"针对恶意软件检测的大型训练数据集类噪声处理","authors":"Dragos Gavrilut, Liviu Ciortuz","doi":"10.1109/SYNASC.2011.39","DOIUrl":null,"url":null,"abstract":"This paper presents the ways we explored until now for detecting and dealing with the class noise found in large annotated datasets used for training the classifiers that we have previously designed for industrial-scale malware identification. First we established a number of distance-based filtering rules that allow us to identify different \"levels'' of potential noise in the training data, and secondly we analysed the effects produced by either removal or \"cleaning'' of the potentially-noised records on the performances of our simplest classifiers. We show that a careful distance-based filtering can lead to sensibly better results in malware detection.","PeriodicalId":184344,"journal":{"name":"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Dealing with Class Noise in Large Training Datasets for Malware Detection\",\"authors\":\"Dragos Gavrilut, Liviu Ciortuz\",\"doi\":\"10.1109/SYNASC.2011.39\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the ways we explored until now for detecting and dealing with the class noise found in large annotated datasets used for training the classifiers that we have previously designed for industrial-scale malware identification. First we established a number of distance-based filtering rules that allow us to identify different \\\"levels'' of potential noise in the training data, and secondly we analysed the effects produced by either removal or \\\"cleaning'' of the potentially-noised records on the performances of our simplest classifiers. We show that a careful distance-based filtering can lead to sensibly better results in malware detection.\",\"PeriodicalId\":184344,\"journal\":{\"name\":\"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SYNASC.2011.39\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2011.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

本文介绍了我们迄今为止探索的方法,用于检测和处理在大型注释数据集中发现的类噪声,这些数据集用于训练我们之前为工业规模恶意软件识别设计的分类器。首先,我们建立了一些基于距离的过滤规则,使我们能够识别训练数据中潜在噪声的不同“级别”,其次,我们分析了删除或“清理”潜在噪声记录对我们最简单分类器性能的影响。我们表明,仔细的基于距离的过滤可以导致更好的恶意软件检测结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dealing with Class Noise in Large Training Datasets for Malware Detection
This paper presents the ways we explored until now for detecting and dealing with the class noise found in large annotated datasets used for training the classifiers that we have previously designed for industrial-scale malware identification. First we established a number of distance-based filtering rules that allow us to identify different "levels'' of potential noise in the training data, and secondly we analysed the effects produced by either removal or "cleaning'' of the potentially-noised records on the performances of our simplest classifiers. We show that a careful distance-based filtering can lead to sensibly better results in malware detection.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信