Detecting Data Semantic: A Data Leakage Prevention Approach

2015 IEEE Trustcom/BigDataSE/ISPA Pub Date : 2015-08-20 DOI:10.1109/Trustcom.2015.464

Sultan Alneyadi, E. Sithirasenan, V. Muthukkumarasamy

{"title":"Detecting Data Semantic: A Data Leakage Prevention Approach","authors":"Sultan Alneyadi, E. Sithirasenan, V. Muthukkumarasamy","doi":"10.1109/Trustcom.2015.464","DOIUrl":null,"url":null,"abstract":"Data leakage prevention systems (DLPSs) are increasingly being implemented by organizations. Unlike standard security mechanisms such as firewalls and intrusion detection systems, DLPSs are designated systems used to protect in use, at rest and in transit data. DLPSs analytically use the content and surrounding context of confidential data to detect and prevent unauthorized access to confidential data. DLPSs that use content analysis techniques are largely dependent upon data fingerprinting, regular expressions, and statistical analysis to detect data leaks. Given that data is susceptible to change, data fingerprinting and regular expressions suffer from shortcomings in detecting the semantics of evolved confidential data. However, statistical analysis can manage any data that appears fuzzy in nature or has other variations. Thus, DLPSs with statistical analysis capabilities can approximate the presence of data semantics. In this paper, a statistical data leakage prevention (DLP) model is presented to classify data on the basis of semantics. This study contributes to the data leakage prevention field by using data statistical analysis to detect evolved confidential data. The approach was based on using the well-known information retrieval function Term Frequency-Inverse Document Frequency (TF-IDF) to classify documents under certain topics. A Singular Value Decomposition (SVD) matrix was also used to visualize the classification results. The results showed that the proposed statistical DLP approach could correctly classify documents even in cases of extreme modification. It also had a high level of precision and recall scores.","PeriodicalId":277092,"journal":{"name":"2015 IEEE Trustcom/BigDataSE/ISPA","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Trustcom/BigDataSE/ISPA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom.2015.464","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

Abstract

Data leakage prevention systems (DLPSs) are increasingly being implemented by organizations. Unlike standard security mechanisms such as firewalls and intrusion detection systems, DLPSs are designated systems used to protect in use, at rest and in transit data. DLPSs analytically use the content and surrounding context of confidential data to detect and prevent unauthorized access to confidential data. DLPSs that use content analysis techniques are largely dependent upon data fingerprinting, regular expressions, and statistical analysis to detect data leaks. Given that data is susceptible to change, data fingerprinting and regular expressions suffer from shortcomings in detecting the semantics of evolved confidential data. However, statistical analysis can manage any data that appears fuzzy in nature or has other variations. Thus, DLPSs with statistical analysis capabilities can approximate the presence of data semantics. In this paper, a statistical data leakage prevention (DLP) model is presented to classify data on the basis of semantics. This study contributes to the data leakage prevention field by using data statistical analysis to detect evolved confidential data. The approach was based on using the well-known information retrieval function Term Frequency-Inverse Document Frequency (TF-IDF) to classify documents under certain topics. A Singular Value Decomposition (SVD) matrix was also used to visualize the classification results. The results showed that the proposed statistical DLP approach could correctly classify documents even in cases of extreme modification. It also had a high level of precision and recall scores.

查看原文本刊更多论文

数据语义检测:一种防止数据泄漏的方法

数据泄漏预防系统(dlps)越来越多地被组织所采用。与防火墙和入侵检测系统等标准安全机制不同，dlps是用于保护使用中、静态和传输中的数据的指定系统。dlps分析地使用机密数据的内容和周围上下文来检测和防止对机密数据的未经授权的访问。使用内容分析技术的dlps在很大程度上依赖于数据指纹、正则表达式和统计分析来检测数据泄漏。由于数据容易发生变化，数据指纹和正则表达式在检测演化的机密数据的语义方面存在缺陷。然而，统计分析可以管理本质上看起来模糊或有其他变化的任何数据。因此，具有统计分析功能的dlps可以近似地表示数据语义的存在。本文提出了一种基于语义的统计数据泄漏预防(DLP)模型。本研究通过数据统计分析来检测演变的机密数据，为数据泄漏预防领域做出了贡献。该方法基于使用著名的信息检索函数术语频率-逆文档频率(TF-IDF)对特定主题下的文档进行分类。采用奇异值分解(SVD)矩阵对分类结果进行可视化处理。结果表明，即使在极端修改的情况下，统计DLP方法也能正确地对文档进行分类。它也有很高的精确度和回忆分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE Trustcom/BigDataSE/ISPA

自引率

0.00%

发文量