用于数据泄漏预防的自适应n图分类模型

2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS) Pub Date : 2013-12-01 DOI:10.1109/ICSPCS.2013.6723919

Sultan Alneyadi, E. Sithirasenan, V. Muthukkumarasamy

{"title":"用于数据泄漏预防的自适应n图分类模型","authors":"Sultan Alneyadi, E. Sithirasenan, V. Muthukkumarasamy","doi":"10.1109/ICSPCS.2013.6723919","DOIUrl":null,"url":null,"abstract":"Data confidentiality, integrity and availability are the ultimate goals for all information security mechanisms. However, most of these mechanisms do not proactively protect sensitive data; rather, they work under predefined policies and conditions to protect data in general. Few systems such as anomaly-based intrusion detection systems (IDS) might work independently without much administrative interference, but with no dedication to sensitivity of data. New mechanisms called data leakage prevention systems (DLP) have been developed to mitigate the risk of sensitive data leakage. Current DLPs mostly use data fingerprinting and exact and partial document matching to classify sensitive data. These approaches can have a serious limitation because they are susceptible to data misidentification. In this paper, we investigate the use of N-grams statistical analysis for data classification purposes. Our method is based on using N-grams frequency to classify documents under distinct categories. We are using simple taxicap geometry to compute the similarity between documents and existing categories. Moreover, we examine the effect of removing the most common words and connecting phrases on the overall classification. We are aiming to compensate the limitations in current data classification approaches used in the field of data leakage prevention. We show that our method is capable of correctly classifying up to 90.5% of the tested documents.","PeriodicalId":294442,"journal":{"name":"2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Adaptable N-gram classification model for data leakage prevention\",\"authors\":\"Sultan Alneyadi, E. Sithirasenan, V. Muthukkumarasamy\",\"doi\":\"10.1109/ICSPCS.2013.6723919\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data confidentiality, integrity and availability are the ultimate goals for all information security mechanisms. However, most of these mechanisms do not proactively protect sensitive data; rather, they work under predefined policies and conditions to protect data in general. Few systems such as anomaly-based intrusion detection systems (IDS) might work independently without much administrative interference, but with no dedication to sensitivity of data. New mechanisms called data leakage prevention systems (DLP) have been developed to mitigate the risk of sensitive data leakage. Current DLPs mostly use data fingerprinting and exact and partial document matching to classify sensitive data. These approaches can have a serious limitation because they are susceptible to data misidentification. In this paper, we investigate the use of N-grams statistical analysis for data classification purposes. Our method is based on using N-grams frequency to classify documents under distinct categories. We are using simple taxicap geometry to compute the similarity between documents and existing categories. Moreover, we examine the effect of removing the most common words and connecting phrases on the overall classification. We are aiming to compensate the limitations in current data classification approaches used in the field of data leakage prevention. We show that our method is capable of correctly classifying up to 90.5% of the tested documents.\",\"PeriodicalId\":294442,\"journal\":{\"name\":\"2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSPCS.2013.6723919\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSPCS.2013.6723919","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

数据的保密性、完整性和可用性是所有信息安全机制的最终目标。然而，这些机制大多没有主动保护敏感数据;相反，它们在预定义的策略和条件下工作，以保护一般数据。很少有系统(如基于异常的入侵检测系统(IDS))可以在没有太多管理干预的情况下独立工作，但对数据的敏感性没有贡献。数据泄漏预防系统(DLP)的新机制已经被开发出来，以减轻敏感数据泄漏的风险。目前的dlp主要采用数据指纹和精确匹配和部分文档匹配来对敏感数据进行分类。这些方法可能有严重的局限性，因为它们容易出现数据错误识别。在本文中，我们研究了N-grams统计分析用于数据分类的目的。我们的方法是基于N-grams频率对不同类别下的文档进行分类。我们使用简单的计程车几何来计算文档和现有类别之间的相似性。此外，我们还研究了去除最常见的单词和连接短语对整体分类的影响。我们的目标是弥补目前在数据泄漏预防领域使用的数据分类方法的局限性。我们表明，我们的方法能够正确分类高达90.5%的测试文档。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adaptable N-gram classification model for data leakage prevention

Data confidentiality, integrity and availability are the ultimate goals for all information security mechanisms. However, most of these mechanisms do not proactively protect sensitive data; rather, they work under predefined policies and conditions to protect data in general. Few systems such as anomaly-based intrusion detection systems (IDS) might work independently without much administrative interference, but with no dedication to sensitivity of data. New mechanisms called data leakage prevention systems (DLP) have been developed to mitigate the risk of sensitive data leakage. Current DLPs mostly use data fingerprinting and exact and partial document matching to classify sensitive data. These approaches can have a serious limitation because they are susceptible to data misidentification. In this paper, we investigate the use of N-grams statistical analysis for data classification purposes. Our method is based on using N-grams frequency to classify documents under distinct categories. We are using simple taxicap geometry to compute the similarity between documents and existing categories. Moreover, we examine the effect of removing the most common words and connecting phrases on the overall classification. We are aiming to compensate the limitations in current data classification approaches used in the field of data leakage prevention. We show that our method is capable of correctly classifying up to 90.5% of the tested documents.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS)

自引率

0.00%

发文量