FCG-MFD: Benchmark function call graph-based dataset for malware family detection

IF 7.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Network and Computer Applications Pub Date : 2024-11-07 DOI:10.1016/j.jnca.2024.104050

Hassan Jalil Hadi , Yue Cao , Sifan Li , Naveed Ahmad , Mohammed Ali Alshara

{"title":"FCG-MFD: Benchmark function call graph-based dataset for malware family detection","authors":"Hassan Jalil Hadi , Yue Cao , Sifan Li , Naveed Ahmad , Mohammed Ali Alshara","doi":"10.1016/j.jnca.2024.104050","DOIUrl":null,"url":null,"abstract":"<div><div>Cyber crimes related to malware families are on the rise. This growth persists despite the prevalence of various antivirus software and approaches for malware detection and classification. Security experts have implemented Machine Learning (ML) techniques to identify these cyber-crimes. However, these approaches demand updated malware datasets for continuous improvements amid the evolving sophistication of malware strains. Thus, we present the FCG-MFD, a benchmark dataset with extensive Function Call Graphs (FCG) for malware family detection. This dataset guarantees resistance against emerging malware families by enabling security systems. Our dataset has two sub-datasets (FCG & Metadata) (1,00,000 samples) from VirusSamples, Virusshare, VirusSign, theZoo, Vx-underground, and MalwareBazaar curated using FCGs and metadata to optimize the efficacy of ML algorithms. We suggest a new malware analysis technique using FCGs and graph embedding networks, offering a solution to the complexity of feature engineering in ML-based malware analysis. Our approach to extracting semantic features via the Natural Language Processing (NLP) method is inspired by tasks involving sentences and words, respectively, for functions and instructions. We leverage a node2vec mechanism-based graph embedding network to generate malware embedding vectors. These vectors enable automated and efficient malware analysis by combining structural and semantic features. We use two datasets (FCG & Metadata) to assess FCG-MFD performance. F1-Scores of 99.14% and 99.28% are competitive with State-of-the-art (SOTA) methods.</div></div>","PeriodicalId":54784,"journal":{"name":"Journal of Network and Computer Applications","volume":"233 ","pages":"Article 104050"},"PeriodicalIF":7.7000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Network and Computer Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1084804524002273","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Cyber crimes related to malware families are on the rise. This growth persists despite the prevalence of various antivirus software and approaches for malware detection and classification. Security experts have implemented Machine Learning (ML) techniques to identify these cyber-crimes. However, these approaches demand updated malware datasets for continuous improvements amid the evolving sophistication of malware strains. Thus, we present the FCG-MFD, a benchmark dataset with extensive Function Call Graphs (FCG) for malware family detection. This dataset guarantees resistance against emerging malware families by enabling security systems. Our dataset has two sub-datasets (FCG & Metadata) (1,00,000 samples) from VirusSamples, Virusshare, VirusSign, theZoo, Vx-underground, and MalwareBazaar curated using FCGs and metadata to optimize the efficacy of ML algorithms. We suggest a new malware analysis technique using FCGs and graph embedding networks, offering a solution to the complexity of feature engineering in ML-based malware analysis. Our approach to extracting semantic features via the Natural Language Processing (NLP) method is inspired by tasks involving sentences and words, respectively, for functions and instructions. We leverage a node2vec mechanism-based graph embedding network to generate malware embedding vectors. These vectors enable automated and efficient malware analysis by combining structural and semantic features. We use two datasets (FCG & Metadata) to assess FCG-MFD performance. F1-Scores of 99.14% and 99.28% are competitive with State-of-the-art (SOTA) methods.

查看原文本刊更多论文

FCG-MFD：基于函数调用图的恶意软件族检测基准数据集

与恶意软件家族有关的网络犯罪呈上升趋势。尽管各种杀毒软件和恶意软件检测与分类方法已经普及，但这种增长趋势依然存在。安全专家采用机器学习（ML）技术来识别这些网络犯罪。然而，这些方法需要更新恶意软件数据集，以便在恶意软件种类不断演变的情况下持续改进。因此，我们提出了 FCG-MFD，这是一个具有大量函数调用图（FCG）的基准数据集，用于恶意软件家族检测。该数据集能确保安全系统抵御新出现的恶意软件家族。我们的数据集包含两个子数据集（FCG & Metadata）（1,00,000 个样本），分别来自 VirusSamples、Virusshare、VirusSign、theZoo、Vx-underground 和 MalwareBazaar，这些数据集利用 FCG 和元数据来优化 ML 算法的功效。我们提出了一种使用 FCG 和图嵌入网络的新型恶意软件分析技术，为基于 ML 的恶意软件分析中复杂的特征工程提供了解决方案。我们通过自然语言处理（NLP）方法提取语义特征的灵感来自分别涉及函数和指令的句子和单词任务。我们利用基于 node2vec 机制的图嵌入网络生成恶意软件嵌入向量。通过结合结构和语义特征，这些向量可实现自动、高效的恶意软件分析。我们使用两个数据集（FCG & Metadata）来评估 FCG-MFD 的性能。F1 分数分别为 99.14% 和 99.28%，与最先进的 (SOTA) 方法相比具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Network and Computer Applications 工程技术-计算机：跨学科应用

CiteScore

21.50

自引率

3.40%

发文量

142

审稿时长

37 days

期刊介绍： The Journal of Network and Computer Applications welcomes research contributions, surveys, and notes in all areas relating to computer networks and applications thereof. Sample topics include new design techniques, interesting or novel applications, components or standards; computer networks with tools such as WWW; emerging standards for internet protocols; Wireless networks; Mobile Computing; emerging computing models such as cloud computing, grid computing; applications of networked systems for remote collaboration and telemedicine, etc. The journal is abstracted and indexed in Scopus, Engineering Index, Web of Science, Science Citation Index Expanded and INSPEC.