Source identification for worm propagation: A graph neural network approach and evaluation on social network and internet datasets

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computer Networks Pub Date : 2025-08-16 DOI:10.1016/j.comnet.2025.111616

Qitao Huo, Peng Zhou

{"title":"Source identification for worm propagation: A graph neural network approach and evaluation on social network and internet datasets","authors":"Qitao Huo, Peng Zhou","doi":"10.1016/j.comnet.2025.111616","DOIUrl":null,"url":null,"abstract":"<div><div>Source identification plays an essential role in the analysis and forensics of worm propagation but unfortunately is quite challenging to solve due to the limited traces and clues left on the observed propagation graphs. State-of-the-art solutions to source identification are mostly based on unsupervised graph induction and reasoning, hence missing the chances to find more trails from additional origins of information for worm tracing. In this paper, we go beyond unsupervised source identification and make perhaps the first attempt to design a supervised solution, to “borrow” outside information to facilitate the detection of the propagation sources. Our basic idea is to apply a graph neural network (GNN) to learn the additional clues (specifically the node state distributions over the graph structures) from a training set of propagation graph samples whose sources are known in advance, hence able to model the mapping relationship between the different node state distributions and the many different nodes as the sources for propagation. This way, we can wisely convert the unsupervised source identification problem to a supervised classification of propagation graphs with the sources as class labels, thereby tracing back the given worm later guided by the similar propagation behaviors found on the sampled propagation graphs. We understand that the direct use of the GNN model is not quite effective in the condition of large graphs since a large number of nodes should be considered individual class labels for classification and accordingly propose a hierarchical improvement. That is, we cluster the nodes from the large graph into several smaller subgraphs (i.e., communities) and then deploy a set of GNN models through a hierarchical architecture for these subgraphs, hence being able to largely reduce the number of class labels for each of the GNN models over this architecture. To evaluate the effectiveness of our solution, we have run extensive source identification experiments using the worm propagation graphs simulated from both the synthetic and social network and Internet datasets. Our results have successfully confirmed a higher identification accuracy (in terms of the length of the shortest path from the identified source to the true one) by our supervised solution compared with the competing counterparts. For the best case, we can improve the identification accuracy up to ten times the magnitude.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"271 ","pages":"Article 111616"},"PeriodicalIF":4.6000,"publicationDate":"2025-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625005833","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Source identification plays an essential role in the analysis and forensics of worm propagation but unfortunately is quite challenging to solve due to the limited traces and clues left on the observed propagation graphs. State-of-the-art solutions to source identification are mostly based on unsupervised graph induction and reasoning, hence missing the chances to find more trails from additional origins of information for worm tracing. In this paper, we go beyond unsupervised source identification and make perhaps the first attempt to design a supervised solution, to “borrow” outside information to facilitate the detection of the propagation sources. Our basic idea is to apply a graph neural network (GNN) to learn the additional clues (specifically the node state distributions over the graph structures) from a training set of propagation graph samples whose sources are known in advance, hence able to model the mapping relationship between the different node state distributions and the many different nodes as the sources for propagation. This way, we can wisely convert the unsupervised source identification problem to a supervised classification of propagation graphs with the sources as class labels, thereby tracing back the given worm later guided by the similar propagation behaviors found on the sampled propagation graphs. We understand that the direct use of the GNN model is not quite effective in the condition of large graphs since a large number of nodes should be considered individual class labels for classification and accordingly propose a hierarchical improvement. That is, we cluster the nodes from the large graph into several smaller subgraphs (i.e., communities) and then deploy a set of GNN models through a hierarchical architecture for these subgraphs, hence being able to largely reduce the number of class labels for each of the GNN models over this architecture. To evaluate the effectiveness of our solution, we have run extensive source identification experiments using the worm propagation graphs simulated from both the synthetic and social network and Internet datasets. Our results have successfully confirmed a higher identification accuracy (in terms of the length of the shortest path from the identified source to the true one) by our supervised solution compared with the competing counterparts. For the best case, we can improve the identification accuracy up to ten times the magnitude.

查看原文本刊更多论文

蠕虫传播的来源识别：一种图神经网络方法和对社会网络和互联网数据集的评估

来源识别在蠕虫传播的分析和取证中起着至关重要的作用，但不幸的是，由于观察到的传播图上留下的痕迹和线索有限，解决起来相当具有挑战性。最先进的源识别解决方案大多基于无监督图归纳和推理，因此错过了从其他来源的信息中找到更多踪迹的机会。在本文中，我们超越了无监督源识别，也许是第一次尝试设计一个有监督的解决方案，“借用”外部信息来方便传播源的检测。我们的基本思想是应用图神经网络（GNN）从预先知道源的传播图样本的训练集中学习额外的线索（特别是图结构上的节点状态分布），从而能够建模不同节点状态分布和许多不同节点之间的映射关系作为传播源。这样，我们就可以明智地将无监督源识别问题转化为以源作为类标签的传播图的监督分类，从而在采样传播图上发现的相似传播行为的指导下，随后追踪给定的蠕虫。我们理解，直接使用GNN模型在大型图的情况下不是很有效，因为大量的节点应该被视为单独的类标签进行分类，并相应地提出分层改进。也就是说，我们将大图中的节点聚类到几个较小的子图中（即社区），然后通过这些子图的分层体系结构部署一组GNN模型，因此能够在此体系结构上大大减少每个GNN模型的类标签数量。为了评估我们的解决方案的有效性，我们使用从合成网络和社交网络以及互联网数据集模拟的蠕虫传播图进行了广泛的源识别实验。与竞争对手相比，我们的结果成功地证实了我们的监督解决方案具有更高的识别精度（从识别源到真实源的最短路径长度）。在最好的情况下，我们可以将识别精度提高十倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Networks 工程技术-电信学

CiteScore

10.80

自引率

3.60%

发文量

434

审稿时长

8.6 months

期刊介绍： Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.