Learning to detect PII: Tabular vs. Document classification models for network traffic analysis

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Information Security and Applications Pub Date : 2025-08-25 DOI:10.1016/j.jisa.2025.104196

Rishika Kohli , Shaifu Gupta , Manoj Singh Gaur , Soma S. Dhavala

{"title":"Learning to detect PII: Tabular vs. Document classification models for network traffic analysis","authors":"Rishika Kohli , Shaifu Gupta , Manoj Singh Gaur , Soma S. Dhavala","doi":"10.1016/j.jisa.2025.104196","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on <em>data size</em>, <em>performance</em>, <em>resource utilization</em>, and <em>generalizability</em>, enabling model selection according to the primary requirement of the deployment scenario.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"94 ","pages":"Article 104196"},"PeriodicalIF":3.7000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625002339","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on data size, performance, resource utilization, and generalizability, enabling model selection according to the primary requirement of the deployment scenario.

查看原文本刊更多论文

学习检测PII：用于网络流量分析的表格与文档分类模型

从移动网络流量中检测个人身份信息（PII）泄露对于保护用户隐私至关重要。传统的方法依赖于机器学习分类器，这些分类器是根据从网络数据包中提取的人工设计的特征进行训练的。深度学习提供了消除对这种外部特征选择过程依赖的潜力；然而，它的有效性很大程度上取决于底层数据包的编码方式。在这项工作中，我们研究了深度学习范式对PII检测的影响，重点关注特征编码策略的影响。我们探索了表格建模方法，包括现有架构（FT-Transformer）和提出的模块化框架，这些框架集成了用于语义特征嵌入的预训练语言模型（all-MiniLM-L6-v2），然后是分类器。我们还通过将预训练的语言模型（如TinyBERT）直接应用于原始包内容来评估文档分类建模。通过使用ONNX和TensorFlow Lite部署训练好的模型，我们进一步证明了设备上推理的可行性。最后，我们推荐基于数据大小、性能、资源利用率和通用性的建模策略，支持根据部署场景的主要需求选择模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Information Security and Applications Computer Science-Computer Networks and Communications

CiteScore

10.90

自引率

5.40%

发文量

206

审稿时长

56 days

期刊介绍： Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.