Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computers Pub Date : 2023-11-21 DOI:10.3390/computers12120242

N. Anđelić, Sandi Baressi Baressi Šegota, Z. Car

{"title":"Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques","authors":"N. Anđelić, Sandi Baressi Baressi Šegota, Z. Car","doi":"10.3390/computers12120242","DOIUrl":null,"url":null,"abstract":"Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing security systems to adapt to evolving threats and identify complex, polymorphic malware that may exhibit varied behaviors. This synergy of hybrid features with AI empowers malware detection systems to efficiently and proactively identify and respond to sophisticated cyber threats in real time. In this paper, the genetic programming symbolic classifier (GPSC) algorithm was applied to the publicly available dataset to obtain symbolic expressions (SEs) that could detect the malware software with high classification performance. The initial problem with the dataset was a high imbalance between class samples, so various oversampling techniques were utilized to obtain balanced dataset variations on which GPSC was applied. To find the optimal combination of GPSC hyperparameter values, the random hyperparameter value search method (RHVS) was developed and applied to obtain SEs with high classification accuracy. The GPSC was trained with five-fold cross-validation (5FCV) to obtain a robust set of SEs on each dataset variation. To choose the best SEs, several evaluation metrics were used, i.e., the length and depth of SEs, accuracy score (ACC), area under receiver operating characteristic curve (AUC), precision, recall, f1-score, and confusion matrix. The best-obtained SEs are applied on the original imbalanced dataset to see if the classification performance is the same as it was on balanced dataset variations. The results of the investigation showed that the proposed method generated SEs with high classification accuracy (0.9962) in malware software detection.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"46 10","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12120242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing security systems to adapt to evolving threats and identify complex, polymorphic malware that may exhibit varied behaviors. This synergy of hybrid features with AI empowers malware detection systems to efficiently and proactively identify and respond to sophisticated cyber threats in real time. In this paper, the genetic programming symbolic classifier (GPSC) algorithm was applied to the publicly available dataset to obtain symbolic expressions (SEs) that could detect the malware software with high classification performance. The initial problem with the dataset was a high imbalance between class samples, so various oversampling techniques were utilized to obtain balanced dataset variations on which GPSC was applied. To find the optimal combination of GPSC hyperparameter values, the random hyperparameter value search method (RHVS) was developed and applied to obtain SEs with high classification accuracy. The GPSC was trained with five-fold cross-validation (5FCV) to obtain a robust set of SEs on each dataset variation. To choose the best SEs, several evaluation metrics were used, i.e., the length and depth of SEs, accuracy score (ACC), area under receiver operating characteristic curve (AUC), precision, recall, f1-score, and confusion matrix. The best-obtained SEs are applied on the original imbalanced dataset to see if the classification performance is the same as it was on balanced dataset variations. The results of the investigation showed that the proposed method generated SEs with high classification accuracy (0.9962) in malware software detection.

查看原文本刊更多论文

应用数据集超采样技术，通过遗传编程符号分类器提高恶意软件检测精度

使用混合功能（将二进制和十六进制分析与 DLL 调用相结合）进行恶意软件检测，对于发挥静态和动态分析方法的优势至关重要。人工智能（AI）通过自动模式识别、异常检测和持续学习增强了这一过程，使安全系统能够适应不断变化的威胁，并识别可能表现出各种行为的复杂多态恶意软件。这种混合功能与人工智能的协同作用使恶意软件检测系统能够高效、主动地实时识别和应对复杂的网络威胁。本文将遗传编程符号分类器（GPSC）算法应用于公开可用的数据集，以获得能够以高分类性能检测恶意软件的符号表达式（SE）。数据集最初的问题是类样本之间的高度不平衡，因此利用了各种超采样技术来获得平衡的数据集变化，并在此基础上应用 GPSC。为了找到 GPSC 超参数值的最佳组合，开发并应用了随机超参数值搜索法（RHVS），以获得分类准确率高的 SE。使用五倍交叉验证（5FCV）训练 GPSC，以在每个数据集变化上获得一组稳健的 SE。为了选择最佳 SE，使用了几个评估指标，即 SE 的长度和深度、准确度得分（ACC）、接收器工作特征曲线下面积（AUC）、精确度、召回率、f1-分数和混淆矩阵。将获得的最佳 SE 应用于原始不平衡数据集，以观察分类性能是否与平衡数据集变化时的性能相同。调查结果表明，在恶意软件检测方面，建议的方法生成的 SE 具有较高的分类准确率（0.9962）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊