{"title":"一种以数据为中心的具有增强效用的安全发布个人数据的$\\ well $ $多样性模型","authors":"Abdul Majeed;Seong Oun Hwang","doi":"10.1109/TBDATA.2024.3524832","DOIUrl":null,"url":null,"abstract":"In this paper, we propose and implement a novel anonymization model, called data-centric <inline-formula><tex-math>$\\ell$</tex-math></inline-formula>-diversity, to effectively safeguard the privacy of individuals with considerably enhanced utility in data publishing scenarios. Through experimental analysis of real-life datasets, we found that when the data quality is poor (e.g., distributions are uneven), most of the existing methods only anonymize some parts of the data (where distributions are balanced) and leave other parts unprocessed, which can lead to explicit privacy disclosures. Furthermore, they do not identify and repair problematic parts of the data before anonymization, and therefore, they are not secure from the threat of privacy breaches. To address these technical problems, in this paper, we implement an automated method that identifies vulnerabilities in the underlying data to be anonymized w.r.t. distribution, and that repairs them by injecting virtual samples of good quality. Later, we implement a data partitioning strategy that creates compact and diverse classes of size <inline-formula><tex-math>$k$</tex-math></inline-formula>, where <inline-formula><tex-math>$k$</tex-math></inline-formula> is the privacy parameter. Finally, only shallow generalization (or no generalization) is applied to each class to minimally generalize the data, whereas existing methods overly distort data by not improving the quality beforehand, which can lead to poor utility in data-driven services. We conducted detailed experiments on four datasets to justify the performance of our model in realistic scenarios, and achieved promising results from the perspectives of boosted accuracy, privacy preservation, data utility enrichment, and reduced computing overheads. Compared with baseline methods, our model enhanced privacy preservation by 36.56% on three different metrics, and data utility was augmented with 18.65% less information loss and 14.37% greater accuracy. Lastly, our model, on average, has shown a 26.13% reduction in time overheads compared to the SOTA baseline methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2278-2295"},"PeriodicalIF":5.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Data-Centric $\\\\ell$ℓ-Diversity Model for Securely Publishing Personal Data With Enhanced Utility\",\"authors\":\"Abdul Majeed;Seong Oun Hwang\",\"doi\":\"10.1109/TBDATA.2024.3524832\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose and implement a novel anonymization model, called data-centric <inline-formula><tex-math>$\\\\ell$</tex-math></inline-formula>-diversity, to effectively safeguard the privacy of individuals with considerably enhanced utility in data publishing scenarios. Through experimental analysis of real-life datasets, we found that when the data quality is poor (e.g., distributions are uneven), most of the existing methods only anonymize some parts of the data (where distributions are balanced) and leave other parts unprocessed, which can lead to explicit privacy disclosures. Furthermore, they do not identify and repair problematic parts of the data before anonymization, and therefore, they are not secure from the threat of privacy breaches. To address these technical problems, in this paper, we implement an automated method that identifies vulnerabilities in the underlying data to be anonymized w.r.t. distribution, and that repairs them by injecting virtual samples of good quality. Later, we implement a data partitioning strategy that creates compact and diverse classes of size <inline-formula><tex-math>$k$</tex-math></inline-formula>, where <inline-formula><tex-math>$k$</tex-math></inline-formula> is the privacy parameter. Finally, only shallow generalization (or no generalization) is applied to each class to minimally generalize the data, whereas existing methods overly distort data by not improving the quality beforehand, which can lead to poor utility in data-driven services. We conducted detailed experiments on four datasets to justify the performance of our model in realistic scenarios, and achieved promising results from the perspectives of boosted accuracy, privacy preservation, data utility enrichment, and reduced computing overheads. Compared with baseline methods, our model enhanced privacy preservation by 36.56% on three different metrics, and data utility was augmented with 18.65% less information loss and 14.37% greater accuracy. Lastly, our model, on average, has shown a 26.13% reduction in time overheads compared to the SOTA baseline methods.\",\"PeriodicalId\":13106,\"journal\":{\"name\":\"IEEE Transactions on Big Data\",\"volume\":\"11 5\",\"pages\":\"2278-2295\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Big Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10819610/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10819610/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
A Data-Centric $\ell$ℓ-Diversity Model for Securely Publishing Personal Data With Enhanced Utility
In this paper, we propose and implement a novel anonymization model, called data-centric $\ell$-diversity, to effectively safeguard the privacy of individuals with considerably enhanced utility in data publishing scenarios. Through experimental analysis of real-life datasets, we found that when the data quality is poor (e.g., distributions are uneven), most of the existing methods only anonymize some parts of the data (where distributions are balanced) and leave other parts unprocessed, which can lead to explicit privacy disclosures. Furthermore, they do not identify and repair problematic parts of the data before anonymization, and therefore, they are not secure from the threat of privacy breaches. To address these technical problems, in this paper, we implement an automated method that identifies vulnerabilities in the underlying data to be anonymized w.r.t. distribution, and that repairs them by injecting virtual samples of good quality. Later, we implement a data partitioning strategy that creates compact and diverse classes of size $k$, where $k$ is the privacy parameter. Finally, only shallow generalization (or no generalization) is applied to each class to minimally generalize the data, whereas existing methods overly distort data by not improving the quality beforehand, which can lead to poor utility in data-driven services. We conducted detailed experiments on four datasets to justify the performance of our model in realistic scenarios, and achieved promising results from the perspectives of boosted accuracy, privacy preservation, data utility enrichment, and reduced computing overheads. Compared with baseline methods, our model enhanced privacy preservation by 36.56% on three different metrics, and data utility was augmented with 18.65% less information loss and 14.37% greater accuracy. Lastly, our model, on average, has shown a 26.13% reduction in time overheads compared to the SOTA baseline methods.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.