InfoClean

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-04-12 DOI:10.1145/3190577

Fei Chiang, Dhruv Gairola

{"title":"InfoClean","authors":"Fei Chiang, Dhruv Gairola","doi":"10.1145/3190577","DOIUrl":null,"url":null,"abstract":"Data quality has become a pervasive challenge for organizations as they wrangle with large, heterogeneous datasets to extract value. Given the proliferation of sensitive and confidential information, it is crucial to consider data privacy concerns during the data cleaning process. For example, in medical database applications, varying levels of privacy are enforced across the attribute values. Attributes such as a patient’s country or city of residence may be less sensitive than the patient’s prescribed medication. Traditional data cleaning techniques assume the data is openly accessible, without considering the differing levels of information sensitivity. In this work, we take the first steps toward a data cleaning model that integrates privacy as part of the data cleaning process. We present a privacy-aware data cleaning framework that differentiates the information content among the attribute values during the data cleaning process to resolve data inconsistencies while minimizing the amount of information disclosed. Our data repair algorithm includes a set of data disclosure operations that considers the information content of the underlying attribute values, while maximizing data utility. Our evaluation using real datasets shows that our algorithm scales well, and achieves improved performance and comparable repair accuracy against existing data cleaning solutions.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"29 1","pages":"1 - 26"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"InfoClean\",\"authors\":\"Fei Chiang, Dhruv Gairola\",\"doi\":\"10.1145/3190577\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data quality has become a pervasive challenge for organizations as they wrangle with large, heterogeneous datasets to extract value. Given the proliferation of sensitive and confidential information, it is crucial to consider data privacy concerns during the data cleaning process. For example, in medical database applications, varying levels of privacy are enforced across the attribute values. Attributes such as a patient’s country or city of residence may be less sensitive than the patient’s prescribed medication. Traditional data cleaning techniques assume the data is openly accessible, without considering the differing levels of information sensitivity. In this work, we take the first steps toward a data cleaning model that integrates privacy as part of the data cleaning process. We present a privacy-aware data cleaning framework that differentiates the information content among the attribute values during the data cleaning process to resolve data inconsistencies while minimizing the amount of information disclosed. Our data repair algorithm includes a set of data disclosure operations that considers the information content of the underlying attribute values, while maximizing data utility. Our evaluation using real datasets shows that our algorithm scales well, and achieves improved performance and comparable repair accuracy against existing data cleaning solutions.\",\"PeriodicalId\":15582,\"journal\":{\"name\":\"Journal of Data and Information Quality (JDIQ)\",\"volume\":\"29 1\",\"pages\":\"1 - 26\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality (JDIQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3190577\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190577","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

数据质量已经成为组织普遍面临的挑战，因为他们与大型异构数据集争论以提取价值。鉴于敏感和机密信息的激增，在数据清理过程中考虑数据隐私问题至关重要。例如，在医疗数据库应用程序中，在属性值之间强制执行不同级别的隐私。患者居住的国家或城市等属性可能不如患者的处方药物敏感。传统的数据清理技术假定数据是公开可访问的，而不考虑不同级别的信息敏感性。在这项工作中，我们向数据清理模型迈出了第一步，该模型将隐私集成为数据清理过程的一部分。我们提出了一个隐私敏感的数据清理框架，该框架在数据清理过程中区分属性值之间的信息内容，以解决数据不一致的问题，同时最大限度地减少信息泄露。我们的数据修复算法包括一组数据披露操作，这些操作考虑了底层属性值的信息内容，同时最大化了数据效用。我们使用真实数据集进行的评估表明，我们的算法可扩展性很好，并且与现有的数据清理解决方案相比，实现了改进的性能和相当的修复精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

InfoClean

Data quality has become a pervasive challenge for organizations as they wrangle with large, heterogeneous datasets to extract value. Given the proliferation of sensitive and confidential information, it is crucial to consider data privacy concerns during the data cleaning process. For example, in medical database applications, varying levels of privacy are enforced across the attribute values. Attributes such as a patient’s country or city of residence may be less sensitive than the patient’s prescribed medication. Traditional data cleaning techniques assume the data is openly accessible, without considering the differing levels of information sensitivity. In this work, we take the first steps toward a data cleaning model that integrates privacy as part of the data cleaning process. We present a privacy-aware data cleaning framework that differentiates the information content among the attribute values during the data cleaning process to resolve data inconsistencies while minimizing the amount of information disclosed. Our data repair algorithm includes a set of data disclosure operations that considers the information content of the underlying attribute values, while maximizing data utility. Our evaluation using real datasets shows that our algorithm scales well, and achieves improved performance and comparable repair accuracy against existing data cleaning solutions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量