{"title":"A novel approach of data sanitization by noise addition and knowledge discovery by clustering","authors":"H. Abdullah, Ahsan Siddiqi, Fuad Bajaber","doi":"10.1109/WSCNIS.2015.7368283","DOIUrl":null,"url":null,"abstract":"Security of published data cannot be less important as compared to unpublished data or the data which is not made public. Therefore, PII (Personally Identifiable Information) is removed and data sanitized when organizations recording large volumes of data publish that data. However, this approach of ensuring data privacy and security can result in loss of utility of that published data for knowledge discovery. Therefore, a balance is required between privacy and the utility needs of published data. In this paper we study this delicate balance by evaluating four data mining clustering techniques for knowledge discovery and propose two privacy/utility quantification parameters. We subsequently perform number of experiments to statistically identify which clustering technique is best suited with desirable level of privacy/utility while noise is incrementally increased by simultaneously degrading data accuracy, completeness and consistency.","PeriodicalId":253256,"journal":{"name":"2015 World Symposium on Computer Networks and Information Security (WSCNIS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 World Symposium on Computer Networks and Information Security (WSCNIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WSCNIS.2015.7368283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Security of published data cannot be less important as compared to unpublished data or the data which is not made public. Therefore, PII (Personally Identifiable Information) is removed and data sanitized when organizations recording large volumes of data publish that data. However, this approach of ensuring data privacy and security can result in loss of utility of that published data for knowledge discovery. Therefore, a balance is required between privacy and the utility needs of published data. In this paper we study this delicate balance by evaluating four data mining clustering techniques for knowledge discovery and propose two privacy/utility quantification parameters. We subsequently perform number of experiments to statistically identify which clustering technique is best suited with desirable level of privacy/utility while noise is incrementally increased by simultaneously degrading data accuracy, completeness and consistency.