Robustness Matters: Pre-Training Can Enhance the Performance of Encrypted Traffic Analysis

IF 8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Information Forensics and Security Pub Date : 2025-09-24 DOI:10.1109/TIFS.2025.3613970

Luming Yang;Lin Liu;Jun-Jie Huang;Jiangyong Shi;Shaojing Fu;Yongjun Wang;Jinshu Su

{"title":"Robustness Matters: Pre-Training Can Enhance the Performance of Encrypted Traffic Analysis","authors":"Luming Yang;Lin Liu;Jun-Jie Huang;Jiangyong Shi;Shaojing Fu;Yongjun Wang;Jinshu Su","doi":"10.1109/TIFS.2025.3613970","DOIUrl":null,"url":null,"abstract":"Models with large-scale parameters and pre-training have been leveraged for encrypted traffic analysis. However, existing researches primarily focused on accuracy, often overlooking the role of large-scale pre-trained parameters in enhancing robustness. While machine learning (ML) and deep learning (DL) models trained from scratch can achieve high accuracy, they exhibit limited robustness. When subjected to network noise in real-world, their identification results can fluctuate significantly, which is unacceptable. Unfortunately, current robustness evaluation methods neglect samples diversity and employ unreasonable noise settings. This field still lacks a reasonable quantitative description of models robustness. In this paper, we propose the PA-curve to display the distribution of sample’s correct-decision stability, which can simultaneously reflect the model’s accuracy and robustness. By calculating the area under the PA-curve, called PA-area, we enable the quantitative assessment of robustness for encrypted traffic analysis. Furthermore, we design a pre-trained model based on packet length sequence, and pre-trained it on TB-scale traffic. By fine-tuning on limited labeled training data, it can achieve downstream analysis tasks. We conduct experiments on five encrypted traffic datasets with different tasks. Besides accuracy, we analyzed the robustness of the pre-trained model and existing methods under common network disturbances, including packet loss, retransmission, and disorder. Experimental results demonstrated that, compared to ML-based and DL-based models trained from scratch, the pre-trained model can not only achieve high accuracy, but also exhibit greater resilience to network noise. The source code is available at <uri>https://github.com/Shangshu-LAB/BERT-ps</uri>","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"10588-10603"},"PeriodicalIF":8.0000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11177602/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Models with large-scale parameters and pre-training have been leveraged for encrypted traffic analysis. However, existing researches primarily focused on accuracy, often overlooking the role of large-scale pre-trained parameters in enhancing robustness. While machine learning (ML) and deep learning (DL) models trained from scratch can achieve high accuracy, they exhibit limited robustness. When subjected to network noise in real-world, their identification results can fluctuate significantly, which is unacceptable. Unfortunately, current robustness evaluation methods neglect samples diversity and employ unreasonable noise settings. This field still lacks a reasonable quantitative description of models robustness. In this paper, we propose the PA-curve to display the distribution of sample’s correct-decision stability, which can simultaneously reflect the model’s accuracy and robustness. By calculating the area under the PA-curve, called PA-area, we enable the quantitative assessment of robustness for encrypted traffic analysis. Furthermore, we design a pre-trained model based on packet length sequence, and pre-trained it on TB-scale traffic. By fine-tuning on limited labeled training data, it can achieve downstream analysis tasks. We conduct experiments on five encrypted traffic datasets with different tasks. Besides accuracy, we analyzed the robustness of the pre-trained model and existing methods under common network disturbances, including packet loss, retransmission, and disorder. Experimental results demonstrated that, compared to ML-based and DL-based models trained from scratch, the pre-trained model can not only achieve high accuracy, but also exhibit greater resilience to network noise. The source code is available at https://github.com/Shangshu-LAB/BERT-ps

查看原文本刊更多论文

鲁棒性问题：预训练可以提高加密流量分析的性能

具有大规模参数和预训练的模型已被用于加密流量分析。然而，现有的研究主要集中在准确性上，往往忽略了大规模预训练参数在增强鲁棒性方面的作用。虽然从头开始训练的机器学习（ML）和深度学习（DL）模型可以达到高精度，但它们表现出有限的鲁棒性。在现实世界中，当受到网络噪声的影响时，它们的识别结果会有很大的波动，这是不可接受的。不幸的是，目前的鲁棒性评估方法忽略了样本多样性，并采用了不合理的噪声设置。该领域仍然缺乏对模型稳健性的合理定量描述。在本文中，我们提出了pa -曲线来显示样本的正确决策稳定性的分布，它可以同时反映模型的准确性和鲁棒性。通过计算pa -曲线下的面积，称为pa -面积，我们可以对加密流量分析的稳健性进行定量评估。在此基础上，设计了基于数据包长度序列的预训练模型，并在tb级流量上进行了预训练。通过对有限的标记训练数据进行微调，可以完成下游分析任务。我们在五个不同任务的加密交通数据集上进行了实验。除了准确性之外，我们还分析了预训练模型和现有方法在常见网络干扰（包括丢包、重传和无序）下的鲁棒性。实验结果表明，与从头开始训练的基于ml和基于dl的模型相比，预训练模型不仅可以达到较高的准确率，而且对网络噪声具有更强的弹性。源代码可从https://github.com/Shangshu-LAB/BERT-ps获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features