A fine-grained message clustering method based on message representation and identifier fingerprints

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computers & Security Pub Date : 2025-08-20 DOI:10.1016/j.cose.2025.104631

Degang Li , Xi Chen, Mingliang Zhu, Qingjun Yuan, Chunxiang Gu

{"title":"A fine-grained message clustering method based on message representation and identifier fingerprints","authors":"Degang Li , Xi Chen, Mingliang Zhu, Qingjun Yuan, Chunxiang Gu","doi":"10.1016/j.cose.2025.104631","DOIUrl":null,"url":null,"abstract":"<div><div>Protocol reverse engineering is a critical technique for analyzing private protocols and unknown protocols. Message clustering is a foundational element of protocol reverse engineering, playing a key role in traffic classification and format inference. In this paper, we propose a fine-grained unknown message clustering method, termed FG-MCRF. FG-MCRF extracts deep representation vectors from the raw message data by constructing a representation network with low information loss and constructs high-purity message clusters based on representation vectors. The FG-MCRF method constructs high-precision global message fingerprints for each message cluster based on message length identifiers, operation identifiers, and counter identifiers. Subsequently, FG-MCRF constructs a message relationship graph based on these global message fingerprints and determines the final message type using the relationship graph. We also introduce the fine-grained multi-protocol dataset (FgMPD) to evaluate the clustering performance of our method. The experimental results demonstrate that the FG-MCRF methodology achieves superior clustering performance on the FgMPD dataset, outperforming other baseline methods. The clustering purity, Adjusted Rand Index (ARI), completeness, and accuracy of FG-MCRF in the fine-grained message clustering task are 0.9961, 0.9897, 0.9837, and 0.9899, respectively, representing improvements of 3.2%, 10.5%, 10.9% and 8.7% compared to state-of-the-art (SOTA) baseline methods. These results indicate that the FG-MCRF method possesses robust generalization capacity and extensibility, facilitating fine-grained message clustering.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"158 ","pages":"Article 104631"},"PeriodicalIF":5.4000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404825003207","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Protocol reverse engineering is a critical technique for analyzing private protocols and unknown protocols. Message clustering is a foundational element of protocol reverse engineering, playing a key role in traffic classification and format inference. In this paper, we propose a fine-grained unknown message clustering method, termed FG-MCRF. FG-MCRF extracts deep representation vectors from the raw message data by constructing a representation network with low information loss and constructs high-purity message clusters based on representation vectors. The FG-MCRF method constructs high-precision global message fingerprints for each message cluster based on message length identifiers, operation identifiers, and counter identifiers. Subsequently, FG-MCRF constructs a message relationship graph based on these global message fingerprints and determines the final message type using the relationship graph. We also introduce the fine-grained multi-protocol dataset (FgMPD) to evaluate the clustering performance of our method. The experimental results demonstrate that the FG-MCRF methodology achieves superior clustering performance on the FgMPD dataset, outperforming other baseline methods. The clustering purity, Adjusted Rand Index (ARI), completeness, and accuracy of FG-MCRF in the fine-grained message clustering task are 0.9961, 0.9897, 0.9837, and 0.9899, respectively, representing improvements of 3.2%, 10.5%, 10.9% and 8.7% compared to state-of-the-art (SOTA) baseline methods. These results indicate that the FG-MCRF method possesses robust generalization capacity and extensibility, facilitating fine-grained message clustering.

Abstract Image

查看原文本刊更多论文

基于消息表示和标识符指纹的细粒度消息聚类方法

协议逆向工程是分析私有协议和未知协议的关键技术。消息聚类是协议逆向工程的一个基本元素，在流量分类和格式推断中起着关键作用。在本文中，我们提出了一种细粒度未知消息聚类方法，称为FG-MCRF。FG-MCRF通过构建低信息丢失的表示网络，从原始消息数据中提取深度表示向量，并基于表示向量构建高纯度的消息聚类。FG-MCRF方法基于消息长度标识符、操作标识符和计数器标识符为每个消息集群构建高精度的全局消息指纹。随后，FG-MCRF基于这些全局消息指纹构建消息关系图，并使用该关系图确定最终的消息类型。我们还引入了细粒度多协议数据集（FgMPD）来评估我们的方法的聚类性能。实验结果表明，FG-MCRF方法在FgMPD数据集上取得了优异的聚类性能，优于其他基线方法。FG-MCRF在细粒度消息聚类任务中的聚类纯度、调整后的Rand指数（ARI）、完整性和准确性分别为0.9961、0.9897、0.9837和0.9899，与最先进的（SOTA）基线方法相比，分别提高了3.2%、10.5%、10.9%和8.7%。结果表明，FG-MCRF方法具有强大的泛化能力和可扩展性，有利于实现细粒度消息聚类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.