一种以数据为中心的具有增强效用的安全发布个人数据的$\ well $ $多样性模型

IF 5.7 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2025-01-01 DOI:10.1109/TBDATA.2024.3524832

Abdul Majeed;Seong Oun Hwang

{"title":"一种以数据为中心的具有增强效用的安全发布个人数据的$\\ well $ $多样性模型","authors":"Abdul Majeed;Seong Oun Hwang","doi":"10.1109/TBDATA.2024.3524832","DOIUrl":null,"url":null,"abstract":"In this paper, we propose and implement a novel anonymization model, called data-centric <inline-formula><tex-math>$\\ell$</tex-math></inline-formula>-diversity, to effectively safeguard the privacy of individuals with considerably enhanced utility in data publishing scenarios. Through experimental analysis of real-life datasets, we found that when the data quality is poor (e.g., distributions are uneven), most of the existing methods only anonymize some parts of the data (where distributions are balanced) and leave other parts unprocessed, which can lead to explicit privacy disclosures. Furthermore, they do not identify and repair problematic parts of the data before anonymization, and therefore, they are not secure from the threat of privacy breaches. To address these technical problems, in this paper, we implement an automated method that identifies vulnerabilities in the underlying data to be anonymized w.r.t. distribution, and that repairs them by injecting virtual samples of good quality. Later, we implement a data partitioning strategy that creates compact and diverse classes of size <inline-formula><tex-math>$k$</tex-math></inline-formula>, where <inline-formula><tex-math>$k$</tex-math></inline-formula> is the privacy parameter. Finally, only shallow generalization (or no generalization) is applied to each class to minimally generalize the data, whereas existing methods overly distort data by not improving the quality beforehand, which can lead to poor utility in data-driven services. We conducted detailed experiments on four datasets to justify the performance of our model in realistic scenarios, and achieved promising results from the perspectives of boosted accuracy, privacy preservation, data utility enrichment, and reduced computing overheads. Compared with baseline methods, our model enhanced privacy preservation by 36.56% on three different metrics, and data utility was augmented with 18.65% less information loss and 14.37% greater accuracy. Lastly, our model, on average, has shown a 26.13% reduction in time overheads compared to the SOTA baseline methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2278-2295"},"PeriodicalIF":5.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Data-Centric $\\\\ell$ℓ-Diversity Model for Securely Publishing Personal Data With Enhanced Utility\",\"authors\":\"Abdul Majeed;Seong Oun Hwang\",\"doi\":\"10.1109/TBDATA.2024.3524832\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose and implement a novel anonymization model, called data-centric <inline-formula><tex-math>$\\\\ell$</tex-math></inline-formula>-diversity, to effectively safeguard the privacy of individuals with considerably enhanced utility in data publishing scenarios. Through experimental analysis of real-life datasets, we found that when the data quality is poor (e.g., distributions are uneven), most of the existing methods only anonymize some parts of the data (where distributions are balanced) and leave other parts unprocessed, which can lead to explicit privacy disclosures. Furthermore, they do not identify and repair problematic parts of the data before anonymization, and therefore, they are not secure from the threat of privacy breaches. To address these technical problems, in this paper, we implement an automated method that identifies vulnerabilities in the underlying data to be anonymized w.r.t. distribution, and that repairs them by injecting virtual samples of good quality. Later, we implement a data partitioning strategy that creates compact and diverse classes of size <inline-formula><tex-math>$k$</tex-math></inline-formula>, where <inline-formula><tex-math>$k$</tex-math></inline-formula> is the privacy parameter. Finally, only shallow generalization (or no generalization) is applied to each class to minimally generalize the data, whereas existing methods overly distort data by not improving the quality beforehand, which can lead to poor utility in data-driven services. We conducted detailed experiments on four datasets to justify the performance of our model in realistic scenarios, and achieved promising results from the perspectives of boosted accuracy, privacy preservation, data utility enrichment, and reduced computing overheads. Compared with baseline methods, our model enhanced privacy preservation by 36.56% on three different metrics, and data utility was augmented with 18.65% less information loss and 14.37% greater accuracy. Lastly, our model, on average, has shown a 26.13% reduction in time overheads compared to the SOTA baseline methods.\",\"PeriodicalId\":13106,\"journal\":{\"name\":\"IEEE Transactions on Big Data\",\"volume\":\"11 5\",\"pages\":\"2278-2295\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Big Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10819610/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10819610/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们提出并实现了一种新的匿名化模型，称为以数据为中心的多样性，以有效地保护个人隐私，并大大增强了数据发布场景中的实用性。通过对真实数据集的实验分析，我们发现当数据质量较差（例如分布不均匀）时，大多数现有方法只对数据的某些部分（分布平衡）进行匿名化处理，而对其他部分不进行处理，这可能导致显式的隐私泄露。此外，在匿名化之前，它们不会识别和修复数据中有问题的部分，因此，它们无法避免隐私泄露的威胁。为了解决这些技术问题，在本文中，我们实现了一种自动化的方法，该方法可以识别要匿名化w.r.t.分布的底层数据中的漏洞，并通过注入高质量的虚拟样本来修复它们。稍后，我们将实现一种数据分区策略，该策略创建大小为$k$的紧凑且多样的类，其中$k$是隐私参数。最后，仅对每个类进行浅泛化（或不泛化）以最小化地泛化数据，而现有方法由于没有事先提高质量而过度扭曲数据，这可能导致数据驱动服务的实用性差。我们在四个数据集上进行了详细的实验，以证明我们的模型在现实场景中的性能，并从提高准确性、隐私保护、数据实用性丰富和减少计算开销的角度取得了令人鼓舞的结果。与基线方法相比，我们的模型在三个不同的指标上增强了36.56%的隐私保护，数据效用增强，信息丢失减少18.65%，准确性提高14.37%。最后，与SOTA基线方法相比，我们的模型平均显示时间开销减少了26.13%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Data-Centric $\ell$ℓ-Diversity Model for Securely Publishing Personal Data With Enhanced Utility

In this paper, we propose and implement a novel anonymization model, called data-centric

$\ell$

-diversity, to effectively safeguard the privacy of individuals with considerably enhanced utility in data publishing scenarios. Through experimental analysis of real-life datasets, we found that when the data quality is poor (e.g., distributions are uneven), most of the existing methods only anonymize some parts of the data (where distributions are balanced) and leave other parts unprocessed, which can lead to explicit privacy disclosures. Furthermore, they do not identify and repair problematic parts of the data before anonymization, and therefore, they are not secure from the threat of privacy breaches. To address these technical problems, in this paper, we implement an automated method that identifies vulnerabilities in the underlying data to be anonymized w.r.t. distribution, and that repairs them by injecting virtual samples of good quality. Later, we implement a data partitioning strategy that creates compact and diverse classes of size

$k$

, where

$k$

is the privacy parameter. Finally, only shallow generalization (or no generalization) is applied to each class to minimally generalize the data, whereas existing methods overly distort data by not improving the quality beforehand, which can lead to poor utility in data-driven services. We conducted detailed experiments on four datasets to justify the performance of our model in realistic scenarios, and achieved promising results from the perspectives of boosted accuracy, privacy preservation, data utility enrichment, and reduced computing overheads. Compared with baseline methods, our model enhanced privacy preservation by 36.56% on three different metrics, and data utility was augmented with 18.65% less information loss and 14.37% greater accuracy. Lastly, our model, on average, has shown a 26.13% reduction in time overheads compared to the SOTA baseline methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.