Degang Li , Xi Chen, Mingliang Zhu, Qingjun Yuan, Chunxiang Gu
{"title":"A fine-grained message clustering method based on message representation and identifier fingerprints","authors":"Degang Li , Xi Chen, Mingliang Zhu, Qingjun Yuan, Chunxiang Gu","doi":"10.1016/j.cose.2025.104631","DOIUrl":null,"url":null,"abstract":"<div><div>Protocol reverse engineering is a critical technique for analyzing private protocols and unknown protocols. Message clustering is a foundational element of protocol reverse engineering, playing a key role in traffic classification and format inference. In this paper, we propose a fine-grained unknown message clustering method, termed FG-MCRF. FG-MCRF extracts deep representation vectors from the raw message data by constructing a representation network with low information loss and constructs high-purity message clusters based on representation vectors. The FG-MCRF method constructs high-precision global message fingerprints for each message cluster based on message length identifiers, operation identifiers, and counter identifiers. Subsequently, FG-MCRF constructs a message relationship graph based on these global message fingerprints and determines the final message type using the relationship graph. We also introduce the fine-grained multi-protocol dataset (FgMPD) to evaluate the clustering performance of our method. The experimental results demonstrate that the FG-MCRF methodology achieves superior clustering performance on the FgMPD dataset, outperforming other baseline methods. The clustering purity, Adjusted Rand Index (ARI), completeness, and accuracy of FG-MCRF in the fine-grained message clustering task are 0.9961, 0.9897, 0.9837, and 0.9899, respectively, representing improvements of 3.2%, 10.5%, 10.9% and 8.7% compared to state-of-the-art (SOTA) baseline methods. These results indicate that the FG-MCRF method possesses robust generalization capacity and extensibility, facilitating fine-grained message clustering.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"158 ","pages":"Article 104631"},"PeriodicalIF":5.4000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404825003207","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Protocol reverse engineering is a critical technique for analyzing private protocols and unknown protocols. Message clustering is a foundational element of protocol reverse engineering, playing a key role in traffic classification and format inference. In this paper, we propose a fine-grained unknown message clustering method, termed FG-MCRF. FG-MCRF extracts deep representation vectors from the raw message data by constructing a representation network with low information loss and constructs high-purity message clusters based on representation vectors. The FG-MCRF method constructs high-precision global message fingerprints for each message cluster based on message length identifiers, operation identifiers, and counter identifiers. Subsequently, FG-MCRF constructs a message relationship graph based on these global message fingerprints and determines the final message type using the relationship graph. We also introduce the fine-grained multi-protocol dataset (FgMPD) to evaluate the clustering performance of our method. The experimental results demonstrate that the FG-MCRF methodology achieves superior clustering performance on the FgMPD dataset, outperforming other baseline methods. The clustering purity, Adjusted Rand Index (ARI), completeness, and accuracy of FG-MCRF in the fine-grained message clustering task are 0.9961, 0.9897, 0.9837, and 0.9899, respectively, representing improvements of 3.2%, 10.5%, 10.9% and 8.7% compared to state-of-the-art (SOTA) baseline methods. These results indicate that the FG-MCRF method possesses robust generalization capacity and extensibility, facilitating fine-grained message clustering.
期刊介绍:
Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world.
Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.