Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
N. Anđelić, Sandi Baressi Baressi Šegota, Z. Car
{"title":"Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques","authors":"N. Anđelić, Sandi Baressi Baressi Šegota, Z. Car","doi":"10.3390/computers12120242","DOIUrl":null,"url":null,"abstract":"Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing security systems to adapt to evolving threats and identify complex, polymorphic malware that may exhibit varied behaviors. This synergy of hybrid features with AI empowers malware detection systems to efficiently and proactively identify and respond to sophisticated cyber threats in real time. In this paper, the genetic programming symbolic classifier (GPSC) algorithm was applied to the publicly available dataset to obtain symbolic expressions (SEs) that could detect the malware software with high classification performance. The initial problem with the dataset was a high imbalance between class samples, so various oversampling techniques were utilized to obtain balanced dataset variations on which GPSC was applied. To find the optimal combination of GPSC hyperparameter values, the random hyperparameter value search method (RHVS) was developed and applied to obtain SEs with high classification accuracy. The GPSC was trained with five-fold cross-validation (5FCV) to obtain a robust set of SEs on each dataset variation. To choose the best SEs, several evaluation metrics were used, i.e., the length and depth of SEs, accuracy score (ACC), area under receiver operating characteristic curve (AUC), precision, recall, f1-score, and confusion matrix. The best-obtained SEs are applied on the original imbalanced dataset to see if the classification performance is the same as it was on balanced dataset variations. The results of the investigation showed that the proposed method generated SEs with high classification accuracy (0.9962) in malware software detection.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"46 10","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12120242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing security systems to adapt to evolving threats and identify complex, polymorphic malware that may exhibit varied behaviors. This synergy of hybrid features with AI empowers malware detection systems to efficiently and proactively identify and respond to sophisticated cyber threats in real time. In this paper, the genetic programming symbolic classifier (GPSC) algorithm was applied to the publicly available dataset to obtain symbolic expressions (SEs) that could detect the malware software with high classification performance. The initial problem with the dataset was a high imbalance between class samples, so various oversampling techniques were utilized to obtain balanced dataset variations on which GPSC was applied. To find the optimal combination of GPSC hyperparameter values, the random hyperparameter value search method (RHVS) was developed and applied to obtain SEs with high classification accuracy. The GPSC was trained with five-fold cross-validation (5FCV) to obtain a robust set of SEs on each dataset variation. To choose the best SEs, several evaluation metrics were used, i.e., the length and depth of SEs, accuracy score (ACC), area under receiver operating characteristic curve (AUC), precision, recall, f1-score, and confusion matrix. The best-obtained SEs are applied on the original imbalanced dataset to see if the classification performance is the same as it was on balanced dataset variations. The results of the investigation showed that the proposed method generated SEs with high classification accuracy (0.9962) in malware software detection.
应用数据集超采样技术,通过遗传编程符号分类器提高恶意软件检测精度
使用混合功能(将二进制和十六进制分析与 DLL 调用相结合)进行恶意软件检测,对于发挥静态和动态分析方法的优势至关重要。人工智能(AI)通过自动模式识别、异常检测和持续学习增强了这一过程,使安全系统能够适应不断变化的威胁,并识别可能表现出各种行为的复杂多态恶意软件。这种混合功能与人工智能的协同作用使恶意软件检测系统能够高效、主动地实时识别和应对复杂的网络威胁。本文将遗传编程符号分类器(GPSC)算法应用于公开可用的数据集,以获得能够以高分类性能检测恶意软件的符号表达式(SE)。数据集最初的问题是类样本之间的高度不平衡,因此利用了各种超采样技术来获得平衡的数据集变化,并在此基础上应用 GPSC。为了找到 GPSC 超参数值的最佳组合,开发并应用了随机超参数值搜索法(RHVS),以获得分类准确率高的 SE。使用五倍交叉验证(5FCV)训练 GPSC,以在每个数据集变化上获得一组稳健的 SE。为了选择最佳 SE,使用了几个评估指标,即 SE 的长度和深度、准确度得分(ACC)、接收器工作特征曲线下面积(AUC)、精确度、召回率、f1-分数和混淆矩阵。将获得的最佳 SE 应用于原始不平衡数据集,以观察分类性能是否与平衡数据集变化时的性能相同。调查结果表明,在恶意软件检测方面,建议的方法生成的 SE 具有较高的分类准确率(0.9962)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computers
Computers COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-
CiteScore
5.40
自引率
3.60%
发文量
153
审稿时长
11 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信