Rishika Kohli , Shaifu Gupta , Manoj Singh Gaur , Soma S. Dhavala
{"title":"学习检测PII:用于网络流量分析的表格与文档分类模型","authors":"Rishika Kohli , Shaifu Gupta , Manoj Singh Gaur , Soma S. Dhavala","doi":"10.1016/j.jisa.2025.104196","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on <em>data size</em>, <em>performance</em>, <em>resource utilization</em>, and <em>generalizability</em>, enabling model selection according to the primary requirement of the deployment scenario.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"94 ","pages":"Article 104196"},"PeriodicalIF":3.7000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning to detect PII: Tabular vs. Document classification models for network traffic analysis\",\"authors\":\"Rishika Kohli , Shaifu Gupta , Manoj Singh Gaur , Soma S. Dhavala\",\"doi\":\"10.1016/j.jisa.2025.104196\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on <em>data size</em>, <em>performance</em>, <em>resource utilization</em>, and <em>generalizability</em>, enabling model selection according to the primary requirement of the deployment scenario.</div></div>\",\"PeriodicalId\":48638,\"journal\":{\"name\":\"Journal of Information Security and Applications\",\"volume\":\"94 \",\"pages\":\"Article 104196\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Security and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2214212625002339\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625002339","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Learning to detect PII: Tabular vs. Document classification models for network traffic analysis
Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on data size, performance, resource utilization, and generalizability, enabling model selection according to the primary requirement of the deployment scenario.
期刊介绍:
Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.