Rishika Kohli , Shaifu Gupta , Manoj Singh Gaur , Soma S. Dhavala
{"title":"Learning to detect PII: Tabular vs. Document classification models for network traffic analysis","authors":"Rishika Kohli , Shaifu Gupta , Manoj Singh Gaur , Soma S. Dhavala","doi":"10.1016/j.jisa.2025.104196","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on <em>data size</em>, <em>performance</em>, <em>resource utilization</em>, and <em>generalizability</em>, enabling model selection according to the primary requirement of the deployment scenario.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"94 ","pages":"Article 104196"},"PeriodicalIF":3.7000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625002339","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on data size, performance, resource utilization, and generalizability, enabling model selection according to the primary requirement of the deployment scenario.
期刊介绍:
Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.