Parallel Instance Filtering for Malware Detection

M. Jureček, Olha Jurecková
{"title":"Parallel Instance Filtering for Malware Detection","authors":"M. Jureček, Olha Jurecková","doi":"10.1109/SEAA56994.2022.00012","DOIUrl":null,"url":null,"abstract":"Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.","PeriodicalId":269970,"journal":{"name":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEAA56994.2022.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.
并行实例过滤恶意软件检测
机器学习算法被广泛应用于恶意软件检测领域。随着样本数量的增长,分类算法的训练成本越来越高。此外,训练数据集可能包含冗余或有噪声的实例。要解决的问题是如何在不降低准确率的情况下从大量的训练数据集中选择有代表性的实例。本文提出了一种新的并行实例选择算法,称为并行实例滤波(PIF)。该算法的主要思想是将数据集划分为覆盖整个数据集的实例的非重叠子集,并对每个子集应用过滤过程。每个子集由具有相同最近敌人的实例组成。因此,PIF算法是快速的,因为子集是通过并行计算相互独立处理的。我们将PIF算法与几种最先进的实例选择算法在500,000个恶意和良性样本的大型数据集上进行比较。特性集是使用静态分析提取的,它包括来自可移植可执行文件格式的元数据。我们的实验结果表明,所提出的实例选择算法显着减少了训练数据集的大小,仅略微降低了准确性。在平均分类准确率与存储百分比的比值方面,PIF算法优于实验中使用的现有实例选择方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信