{"title":"MalGEA: A malware analysis framework via matrix factorization based node embedding and graph external attention","authors":"Ruisheng Li , Qilong Zhang , Huimin Shen","doi":"10.1016/j.array.2025.100493","DOIUrl":null,"url":null,"abstract":"<div><div>As one of the major threats in cybersecurity, malware has been growing continuously and steadily. In recent years, researchers have proposed a number of graph representation learning based malware detection methods by leveraging the intrinsic topological features of malware, which has led to considerable development in this area. However, these existing malware studies still have two major limitations. (1) The complex topological structures of malware graphs often result in high computational overhead during feature extraction and processing. (2) Most existing approaches rely on conventional graph neural networks that are not specifically designed for malware classification tasks, leading to suboptimal performance, especially when dealing with minority class samples. To address these problems, we propose MalGEA, a novel malware detection and classification framework based on matrix factorization and graph external attention mechanisms. First, MalGEA extracts function call information from malware and constructs corresponding function call graphs. These graphs are then processed using sparse matrix factorization and spectral propagation to efficiently generate node embeddings. Finally, we employ an graph external attention network to model inter-graph relationships and perform malware detection and classification. To evaluate our approach, we utilized a benchmark malware dataset which contains 6 categories and 35 families, including 50k benign and 50k malicious samples. Experimental results demonstrate that our method significantly outperforms existing node embedding approaches in terms of computational efficiency, while also achieving high accuracy in malware detection and family classification tasks.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100493"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001201","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
As one of the major threats in cybersecurity, malware has been growing continuously and steadily. In recent years, researchers have proposed a number of graph representation learning based malware detection methods by leveraging the intrinsic topological features of malware, which has led to considerable development in this area. However, these existing malware studies still have two major limitations. (1) The complex topological structures of malware graphs often result in high computational overhead during feature extraction and processing. (2) Most existing approaches rely on conventional graph neural networks that are not specifically designed for malware classification tasks, leading to suboptimal performance, especially when dealing with minority class samples. To address these problems, we propose MalGEA, a novel malware detection and classification framework based on matrix factorization and graph external attention mechanisms. First, MalGEA extracts function call information from malware and constructs corresponding function call graphs. These graphs are then processed using sparse matrix factorization and spectral propagation to efficiently generate node embeddings. Finally, we employ an graph external attention network to model inter-graph relationships and perform malware detection and classification. To evaluate our approach, we utilized a benchmark malware dataset which contains 6 categories and 35 families, including 50k benign and 50k malicious samples. Experimental results demonstrate that our method significantly outperforms existing node embedding approaches in terms of computational efficiency, while also achieving high accuracy in malware detection and family classification tasks.