基于twitter的文档嵌入和查询扩展的文档敏感性分类防止数据泄漏

Lap Q. Trieu, Trung-Nguyen Tran, Mai-Khiem Tran, Minh-Triet Tran
{"title":"基于twitter的文档嵌入和查询扩展的文档敏感性分类防止数据泄漏","authors":"Lap Q. Trieu, Trung-Nguyen Tran, Mai-Khiem Tran, Minh-Triet Tran","doi":"10.1109/CIS.2017.00125","DOIUrl":null,"url":null,"abstract":"Document sensitivity classification is essential to prevent potential sensitive data leakage for individuals and organizations. As most of existing methods use regular expressions or data fingerprinting to classify sensitive documents, they may not fully exploit the semantic and content of a document, especially with informal messages and files. This motivates the authors to propose a novel method to classify document sensitivity in realtime with better semantic and content analysis. Taking advantages of deep learning in natural language processing, we use our pre-trained Twitter-based document embedding TD2V to encode a document or a text fragment into a fixed length vector of 300 dimensions. Then we use retrieval and automatic query expansion to retrieve a re-ranked list of semantically similar known documents, and determine the sensitivity score for a new document from those of the retrieved documents in this list. Experimental results show that our method can achieve classification accuracy of more than 99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and 98.34% for Enron dataset. Furthermore, our method can early predict a sensitive document from a short text fragment with the accuracy higher than 98.84%.","PeriodicalId":304958,"journal":{"name":"2017 13th International Conference on Computational Intelligence and Security (CIS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion\",\"authors\":\"Lap Q. Trieu, Trung-Nguyen Tran, Mai-Khiem Tran, Minh-Triet Tran\",\"doi\":\"10.1109/CIS.2017.00125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document sensitivity classification is essential to prevent potential sensitive data leakage for individuals and organizations. As most of existing methods use regular expressions or data fingerprinting to classify sensitive documents, they may not fully exploit the semantic and content of a document, especially with informal messages and files. This motivates the authors to propose a novel method to classify document sensitivity in realtime with better semantic and content analysis. Taking advantages of deep learning in natural language processing, we use our pre-trained Twitter-based document embedding TD2V to encode a document or a text fragment into a fixed length vector of 300 dimensions. Then we use retrieval and automatic query expansion to retrieve a re-ranked list of semantically similar known documents, and determine the sensitivity score for a new document from those of the retrieved documents in this list. Experimental results show that our method can achieve classification accuracy of more than 99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and 98.34% for Enron dataset. Furthermore, our method can early predict a sensitive document from a short text fragment with the accuracy higher than 98.84%.\",\"PeriodicalId\":304958,\"journal\":{\"name\":\"2017 13th International Conference on Computational Intelligence and Security (CIS)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 13th International Conference on Computational Intelligence and Security (CIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIS.2017.00125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 13th International Conference on Computational Intelligence and Security (CIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIS.2017.00125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

文档敏感性分类对于防止个人和组织潜在的敏感数据泄露至关重要。由于大多数现有方法使用正则表达式或数据指纹对敏感文档进行分类,因此它们可能无法充分利用文档的语义和内容,特别是对于非正式消息和文件。这促使作者提出了一种新的方法,通过更好的语义和内容分析来实时分类文档敏感性。利用自然语言处理中的深度学习优势,我们使用预训练的基于twitter的文档嵌入TD2V将文档或文本片段编码为300维的固定长度向量。然后,我们使用检索和自动查询扩展来检索语义相似的已知文档的重新排序列表,并从该列表中检索到的文档中确定新文档的灵敏度得分。实验结果表明,该方法对斯诺登、Mormon、Dyncorp、TM 4个数据集的分类准确率达到99.9%以上,对安然数据集的分类准确率达到98.34%以上。此外,我们的方法可以从短文本片段中早期预测敏感文档,准确率高于98.84%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion
Document sensitivity classification is essential to prevent potential sensitive data leakage for individuals and organizations. As most of existing methods use regular expressions or data fingerprinting to classify sensitive documents, they may not fully exploit the semantic and content of a document, especially with informal messages and files. This motivates the authors to propose a novel method to classify document sensitivity in realtime with better semantic and content analysis. Taking advantages of deep learning in natural language processing, we use our pre-trained Twitter-based document embedding TD2V to encode a document or a text fragment into a fixed length vector of 300 dimensions. Then we use retrieval and automatic query expansion to retrieve a re-ranked list of semantically similar known documents, and determine the sensitivity score for a new document from those of the retrieved documents in this list. Experimental results show that our method can achieve classification accuracy of more than 99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and 98.34% for Enron dataset. Furthermore, our method can early predict a sensitive document from a short text fragment with the accuracy higher than 98.84%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信