{"title":"Ensemble of Filter and Embedded Feature Selection Techniques for Malware Classification using High-dimensional Jar Extension Dataset","authors":"Yi Wei Tye, U. K. Yusof, Samat Tulpar","doi":"10.1145/3587828.3587849","DOIUrl":null,"url":null,"abstract":"Innovations in machine learning algorithms have enhanced the effectiveness of malware detection systems during the previous decades. However, the advancement of high throughput technologies results in high dimensional malware data, making feature selection useful and mandatory in such datasets. The feature selection technique is an information retrieval tool that aims to improve classifiers by listing important features, which also aids in reducing computational overload. However, different feature selection algorithms select representative features using various criteria, making it difficult to determine the optimal technique for distinct domain datasets. Ensemble feature selection approaches, which integrate the results of several feature selections, can be used to overcome the inadequacies of single-feature selection methods. Therefore, this paper attempts to determine whether the heterogeneous ensemble of filter and embedded feature selection approaches, namely the heterogenous ensemble of ANOVA F-test, ReliefF, L1-penalized logistic regression, LASSO regression, Extra-Tree Classifier and XGBoost feature selection techniques, namely HEFS-ARLLEX, can provide a better classification performance than the single feature selection technique and other ensemble feature selection approaches for malware classification data. The experimental results show that HEFS-ARLLEX, which combines both filters and embedded, is a better choice, providing consistently high classification accuracy, recall, precision, specificity and F-measure and a reasonable feature reduction rate for malware classification dataset.","PeriodicalId":340917,"journal":{"name":"Proceedings of the 2023 12th International Conference on Software and Computer Applications","volume":"113 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 12th International Conference on Software and Computer Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3587828.3587849","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Innovations in machine learning algorithms have enhanced the effectiveness of malware detection systems during the previous decades. However, the advancement of high throughput technologies results in high dimensional malware data, making feature selection useful and mandatory in such datasets. The feature selection technique is an information retrieval tool that aims to improve classifiers by listing important features, which also aids in reducing computational overload. However, different feature selection algorithms select representative features using various criteria, making it difficult to determine the optimal technique for distinct domain datasets. Ensemble feature selection approaches, which integrate the results of several feature selections, can be used to overcome the inadequacies of single-feature selection methods. Therefore, this paper attempts to determine whether the heterogeneous ensemble of filter and embedded feature selection approaches, namely the heterogenous ensemble of ANOVA F-test, ReliefF, L1-penalized logistic regression, LASSO regression, Extra-Tree Classifier and XGBoost feature selection techniques, namely HEFS-ARLLEX, can provide a better classification performance than the single feature selection technique and other ensemble feature selection approaches for malware classification data. The experimental results show that HEFS-ARLLEX, which combines both filters and embedded, is a better choice, providing consistently high classification accuracy, recall, precision, specificity and F-measure and a reasonable feature reduction rate for malware classification dataset.