Simone Andriani , Stefano Galantucci , Andrea Iannacone , Antonio Maci , Giuseppe Pirlo
{"title":"CNN-AutoMIC: Combining convolutional neural network and autoencoder to learn non-linear features for KNN-based malware image classification","authors":"Simone Andriani , Stefano Galantucci , Andrea Iannacone , Antonio Maci , Giuseppe Pirlo","doi":"10.1016/j.cose.2025.104507","DOIUrl":null,"url":null,"abstract":"<div><div>Malware refers to malicious software or a component of software intended for malicious purposes. The manual analysis and detection of malicious software is challenging due to its complexity. Thus, several automated solutions have become popular for real-time malware detection. A spread-out approach consists of generating images from the samples bytecode and giving them to convolutional neural networks (CNNs), which are used either as classifiers or feature extractors for further classification algorithms. These systems perform extremely well when trained and tested on partitions of the same dataset. However, cross-dataset tests and malware detection verification on emerging real-world samples are required in the real-world context. This is a crucial challenge when probing the robustness of the systems and models. This paper proposes CNN-AutoMIC,a robust automated approach to extract features from malware images. CNN-AutoMIC employs a specific CNN architecture to extract features, followed by an autoencoder-based compressor that reduces features to two fundamental components. The two-dimensional projection of these components is the basis of the predictions performed by the K-nearest neighbors (K-NN) algorithm. Moreover, the observable placement of new samples on the obtained scatter plot makes it possible to explain why the AI-based system produced a certain prediction. It was benchmarked against several CNN-based models and a Vision Transformer. They were trained on the Malevis dataset and cross-dataset evaluated on four different real-world datasets. CNN-AutoMIC outperformed the competitors for each classification performance metric, while requiring a reasonable training and prediction time. In addition, it achieves a promising Akaike information criterion (AIC) score, indicating its efficiency in terms of model complexity.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"156 ","pages":"Article 104507"},"PeriodicalIF":4.8000,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404825001968","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Malware refers to malicious software or a component of software intended for malicious purposes. The manual analysis and detection of malicious software is challenging due to its complexity. Thus, several automated solutions have become popular for real-time malware detection. A spread-out approach consists of generating images from the samples bytecode and giving them to convolutional neural networks (CNNs), which are used either as classifiers or feature extractors for further classification algorithms. These systems perform extremely well when trained and tested on partitions of the same dataset. However, cross-dataset tests and malware detection verification on emerging real-world samples are required in the real-world context. This is a crucial challenge when probing the robustness of the systems and models. This paper proposes CNN-AutoMIC,a robust automated approach to extract features from malware images. CNN-AutoMIC employs a specific CNN architecture to extract features, followed by an autoencoder-based compressor that reduces features to two fundamental components. The two-dimensional projection of these components is the basis of the predictions performed by the K-nearest neighbors (K-NN) algorithm. Moreover, the observable placement of new samples on the obtained scatter plot makes it possible to explain why the AI-based system produced a certain prediction. It was benchmarked against several CNN-based models and a Vision Transformer. They were trained on the Malevis dataset and cross-dataset evaluated on four different real-world datasets. CNN-AutoMIC outperformed the competitors for each classification performance metric, while requiring a reasonable training and prediction time. In addition, it achieves a promising Akaike information criterion (AIC) score, indicating its efficiency in terms of model complexity.
期刊介绍:
Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world.
Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.