Lap Q. Trieu, Trung-Nguyen Tran, Mai-Khiem Tran, Minh-Triet Tran
{"title":"基于twitter的文档嵌入和查询扩展的文档敏感性分类防止数据泄漏","authors":"Lap Q. Trieu, Trung-Nguyen Tran, Mai-Khiem Tran, Minh-Triet Tran","doi":"10.1109/CIS.2017.00125","DOIUrl":null,"url":null,"abstract":"Document sensitivity classification is essential to prevent potential sensitive data leakage for individuals and organizations. As most of existing methods use regular expressions or data fingerprinting to classify sensitive documents, they may not fully exploit the semantic and content of a document, especially with informal messages and files. This motivates the authors to propose a novel method to classify document sensitivity in realtime with better semantic and content analysis. Taking advantages of deep learning in natural language processing, we use our pre-trained Twitter-based document embedding TD2V to encode a document or a text fragment into a fixed length vector of 300 dimensions. Then we use retrieval and automatic query expansion to retrieve a re-ranked list of semantically similar known documents, and determine the sensitivity score for a new document from those of the retrieved documents in this list. Experimental results show that our method can achieve classification accuracy of more than 99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and 98.34% for Enron dataset. Furthermore, our method can early predict a sensitive document from a short text fragment with the accuracy higher than 98.84%.","PeriodicalId":304958,"journal":{"name":"2017 13th International Conference on Computational Intelligence and Security (CIS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion\",\"authors\":\"Lap Q. Trieu, Trung-Nguyen Tran, Mai-Khiem Tran, Minh-Triet Tran\",\"doi\":\"10.1109/CIS.2017.00125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document sensitivity classification is essential to prevent potential sensitive data leakage for individuals and organizations. As most of existing methods use regular expressions or data fingerprinting to classify sensitive documents, they may not fully exploit the semantic and content of a document, especially with informal messages and files. This motivates the authors to propose a novel method to classify document sensitivity in realtime with better semantic and content analysis. Taking advantages of deep learning in natural language processing, we use our pre-trained Twitter-based document embedding TD2V to encode a document or a text fragment into a fixed length vector of 300 dimensions. Then we use retrieval and automatic query expansion to retrieve a re-ranked list of semantically similar known documents, and determine the sensitivity score for a new document from those of the retrieved documents in this list. Experimental results show that our method can achieve classification accuracy of more than 99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and 98.34% for Enron dataset. Furthermore, our method can early predict a sensitive document from a short text fragment with the accuracy higher than 98.84%.\",\"PeriodicalId\":304958,\"journal\":{\"name\":\"2017 13th International Conference on Computational Intelligence and Security (CIS)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 13th International Conference on Computational Intelligence and Security (CIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIS.2017.00125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 13th International Conference on Computational Intelligence and Security (CIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIS.2017.00125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion
Document sensitivity classification is essential to prevent potential sensitive data leakage for individuals and organizations. As most of existing methods use regular expressions or data fingerprinting to classify sensitive documents, they may not fully exploit the semantic and content of a document, especially with informal messages and files. This motivates the authors to propose a novel method to classify document sensitivity in realtime with better semantic and content analysis. Taking advantages of deep learning in natural language processing, we use our pre-trained Twitter-based document embedding TD2V to encode a document or a text fragment into a fixed length vector of 300 dimensions. Then we use retrieval and automatic query expansion to retrieve a re-ranked list of semantically similar known documents, and determine the sensitivity score for a new document from those of the retrieved documents in this list. Experimental results show that our method can achieve classification accuracy of more than 99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and 98.34% for Enron dataset. Furthermore, our method can early predict a sensitive document from a short text fragment with the accuracy higher than 98.84%.