学习检测PII:用于网络流量分析的表格与文档分类模型

IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Rishika Kohli , Shaifu Gupta , Manoj Singh Gaur , Soma S. Dhavala
{"title":"学习检测PII:用于网络流量分析的表格与文档分类模型","authors":"Rishika Kohli ,&nbsp;Shaifu Gupta ,&nbsp;Manoj Singh Gaur ,&nbsp;Soma S. Dhavala","doi":"10.1016/j.jisa.2025.104196","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on <em>data size</em>, <em>performance</em>, <em>resource utilization</em>, and <em>generalizability</em>, enabling model selection according to the primary requirement of the deployment scenario.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"94 ","pages":"Article 104196"},"PeriodicalIF":3.7000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning to detect PII: Tabular vs. Document classification models for network traffic analysis\",\"authors\":\"Rishika Kohli ,&nbsp;Shaifu Gupta ,&nbsp;Manoj Singh Gaur ,&nbsp;Soma S. Dhavala\",\"doi\":\"10.1016/j.jisa.2025.104196\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on <em>data size</em>, <em>performance</em>, <em>resource utilization</em>, and <em>generalizability</em>, enabling model selection according to the primary requirement of the deployment scenario.</div></div>\",\"PeriodicalId\":48638,\"journal\":{\"name\":\"Journal of Information Security and Applications\",\"volume\":\"94 \",\"pages\":\"Article 104196\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Security and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2214212625002339\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625002339","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

从移动网络流量中检测个人身份信息(PII)泄露对于保护用户隐私至关重要。传统的方法依赖于机器学习分类器,这些分类器是根据从网络数据包中提取的人工设计的特征进行训练的。深度学习提供了消除对这种外部特征选择过程依赖的潜力;然而,它的有效性很大程度上取决于底层数据包的编码方式。在这项工作中,我们研究了深度学习范式对PII检测的影响,重点关注特征编码策略的影响。我们探索了表格建模方法,包括现有架构(FT-Transformer)和提出的模块化框架,这些框架集成了用于语义特征嵌入的预训练语言模型(all-MiniLM-L6-v2),然后是分类器。我们还通过将预训练的语言模型(如TinyBERT)直接应用于原始包内容来评估文档分类建模。通过使用ONNX和TensorFlow Lite部署训练好的模型,我们进一步证明了设备上推理的可行性。最后,我们推荐基于数据大小、性能、资源利用率和通用性的建模策略,支持根据部署场景的主要需求选择模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Learning to detect PII: Tabular vs. Document classification models for network traffic analysis
Detecting Personally Identifiable Information (PII) exfiltration from mobile network traffic is critical for preserving user privacy. Traditional approaches rely on machine learning classifiers trained on manually engineered features extracted from network packets. Deep learning offers the potential to remove the reliance on such an external feature selection process; however, its effectiveness depends significantly on how underlying packets are encoded. In this work, we investigate deep learning paradigms for PII detection with a focus on the impact of feature encoding strategies. We explore tabular modeling approaches, including both an existing architecture (FT-Transformer) and proposed modular frameworks that integrates a pretrained language model (all-MiniLM-L6-v2) for semantic feature embeddings, followed by a classifier. We also evaluate document classification modeling by applying pretrained language models such as TinyBERT directly to the raw packet content. We further demonstrate the feasibility of on-device inference by deploying trained models using ONNX and TensorFlow Lite. Finally, we recommend modeling strategies based on data size, performance, resource utilization, and generalizability, enabling model selection according to the primary requirement of the deployment scenario.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Information Security and Applications
Journal of Information Security and Applications Computer Science-Computer Networks and Communications
CiteScore
10.90
自引率
5.40%
发文量
206
审稿时长
56 days
期刊介绍: Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信