{"title":"Source identification for worm propagation: A graph neural network approach and evaluation on social network and internet datasets","authors":"Qitao Huo, Peng Zhou","doi":"10.1016/j.comnet.2025.111616","DOIUrl":null,"url":null,"abstract":"<div><div>Source identification plays an essential role in the analysis and forensics of worm propagation but unfortunately is quite challenging to solve due to the limited traces and clues left on the observed propagation graphs. State-of-the-art solutions to source identification are mostly based on unsupervised graph induction and reasoning, hence missing the chances to find more trails from additional origins of information for worm tracing. In this paper, we go beyond unsupervised source identification and make perhaps the first attempt to design a supervised solution, to “borrow” outside information to facilitate the detection of the propagation sources. Our basic idea is to apply a graph neural network (GNN) to learn the additional clues (specifically the node state distributions over the graph structures) from a training set of propagation graph samples whose sources are known in advance, hence able to model the mapping relationship between the different node state distributions and the many different nodes as the sources for propagation. This way, we can wisely convert the unsupervised source identification problem to a supervised classification of propagation graphs with the sources as class labels, thereby tracing back the given worm later guided by the similar propagation behaviors found on the sampled propagation graphs. We understand that the direct use of the GNN model is not quite effective in the condition of large graphs since a large number of nodes should be considered individual class labels for classification and accordingly propose a hierarchical improvement. That is, we cluster the nodes from the large graph into several smaller subgraphs (i.e., communities) and then deploy a set of GNN models through a hierarchical architecture for these subgraphs, hence being able to largely reduce the number of class labels for each of the GNN models over this architecture. To evaluate the effectiveness of our solution, we have run extensive source identification experiments using the worm propagation graphs simulated from both the synthetic and social network and Internet datasets. Our results have successfully confirmed a higher identification accuracy (in terms of the length of the shortest path from the identified source to the true one) by our supervised solution compared with the competing counterparts. For the best case, we can improve the identification accuracy up to ten times the magnitude.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"271 ","pages":"Article 111616"},"PeriodicalIF":4.6000,"publicationDate":"2025-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625005833","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Source identification plays an essential role in the analysis and forensics of worm propagation but unfortunately is quite challenging to solve due to the limited traces and clues left on the observed propagation graphs. State-of-the-art solutions to source identification are mostly based on unsupervised graph induction and reasoning, hence missing the chances to find more trails from additional origins of information for worm tracing. In this paper, we go beyond unsupervised source identification and make perhaps the first attempt to design a supervised solution, to “borrow” outside information to facilitate the detection of the propagation sources. Our basic idea is to apply a graph neural network (GNN) to learn the additional clues (specifically the node state distributions over the graph structures) from a training set of propagation graph samples whose sources are known in advance, hence able to model the mapping relationship between the different node state distributions and the many different nodes as the sources for propagation. This way, we can wisely convert the unsupervised source identification problem to a supervised classification of propagation graphs with the sources as class labels, thereby tracing back the given worm later guided by the similar propagation behaviors found on the sampled propagation graphs. We understand that the direct use of the GNN model is not quite effective in the condition of large graphs since a large number of nodes should be considered individual class labels for classification and accordingly propose a hierarchical improvement. That is, we cluster the nodes from the large graph into several smaller subgraphs (i.e., communities) and then deploy a set of GNN models through a hierarchical architecture for these subgraphs, hence being able to largely reduce the number of class labels for each of the GNN models over this architecture. To evaluate the effectiveness of our solution, we have run extensive source identification experiments using the worm propagation graphs simulated from both the synthetic and social network and Internet datasets. Our results have successfully confirmed a higher identification accuracy (in terms of the length of the shortest path from the identified source to the true one) by our supervised solution compared with the competing counterparts. For the best case, we can improve the identification accuracy up to ten times the magnitude.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.