局部差分隐私下高维分类数据的非交互k型聚类

IF 6.8 1区计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Sciences Pub Date : 2025-06-11 DOI:10.1016/j.ins.2025.122417

Xinxin Ye , Youwen Zhu , Shunsheng Zhang , Hai Deng , Pengfei Yu

{"title":"局部差分隐私下高维分类数据的非交互k型聚类","authors":"Xinxin Ye , Youwen Zhu , Shunsheng Zhang , Hai Deng , Pengfei Yu","doi":"10.1016/j.ins.2025.122417","DOIUrl":null,"url":null,"abstract":"<div><div>High-dimensional categorical data contains rich user-sensitive information, which poses huge privacy threats to users once leaked during data clustering and analysis. The existing <em>K</em>-mode method under local differential privacy (LDP) always requires multiple user-server interactions, which not only has high communication overhead and computational cost but also makes user privacy vulnerable to malicious attacks during interactions. In this paper, we propose a non-interactive LDP <em>K</em>-mode clustering estimation method for high-dimensional categorical data. We first perform dimensionality reduction on each user data locally through the Fsketch algorithm. Then, we perturb the sketch data, ensuring the perturbation satisfies LDP. The perturbed data is then submitted to the server for <em>K</em>-mode clustering. Finally, on the server, we directly estimate the Hamming distance on the perturbed data to achieve <em>K</em>-mode clustering analysis. It is theoretically proven that our perturbation method satisfies LDP and is unbiased. Compared with state-of-the-art methods, our scheme more accurately estimates the Hamming distance between high-dimensional categorical data, reducing communication overhead due to only one interaction. Extensive experiments demonstrate the effectiveness of our method on four real high-dimensional category data sets, in which our scheme has a smaller normalized intra-cluster variance and larger purity index under the same privacy budget.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"718 ","pages":"Article 122417"},"PeriodicalIF":6.8000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Non-interactive K-mode clustering of high-dimensional categorical data under local differential privacy\",\"authors\":\"Xinxin Ye , Youwen Zhu , Shunsheng Zhang , Hai Deng , Pengfei Yu\",\"doi\":\"10.1016/j.ins.2025.122417\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>High-dimensional categorical data contains rich user-sensitive information, which poses huge privacy threats to users once leaked during data clustering and analysis. The existing <em>K</em>-mode method under local differential privacy (LDP) always requires multiple user-server interactions, which not only has high communication overhead and computational cost but also makes user privacy vulnerable to malicious attacks during interactions. In this paper, we propose a non-interactive LDP <em>K</em>-mode clustering estimation method for high-dimensional categorical data. We first perform dimensionality reduction on each user data locally through the Fsketch algorithm. Then, we perturb the sketch data, ensuring the perturbation satisfies LDP. The perturbed data is then submitted to the server for <em>K</em>-mode clustering. Finally, on the server, we directly estimate the Hamming distance on the perturbed data to achieve <em>K</em>-mode clustering analysis. It is theoretically proven that our perturbation method satisfies LDP and is unbiased. Compared with state-of-the-art methods, our scheme more accurately estimates the Hamming distance between high-dimensional categorical data, reducing communication overhead due to only one interaction. Extensive experiments demonstrate the effectiveness of our method on four real high-dimensional category data sets, in which our scheme has a smaller normalized intra-cluster variance and larger purity index under the same privacy budget.</div></div>\",\"PeriodicalId\":51063,\"journal\":{\"name\":\"Information Sciences\",\"volume\":\"718 \",\"pages\":\"Article 122417\"},\"PeriodicalIF\":6.8000,\"publicationDate\":\"2025-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0020025525005493\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025525005493","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

高维分类数据包含丰富的用户敏感信息，在数据聚类和分析过程中一旦泄露，将对用户的隐私构成巨大威胁。现有的本地差分隐私（LDP）下的k模式方法总是需要多次用户与服务器交互，这不仅通信开销和计算成本高，而且用户隐私在交互过程中容易受到恶意攻击。本文针对高维分类数据，提出了一种非交互的LDP k型聚类估计方法。我们首先通过Fsketch算法对每个用户数据进行局部降维。然后，对草图数据进行摄动，保证摄动满足LDP。然后将扰动数据提交给服务器进行k模式集群。最后，在服务器端，我们直接估计扰动数据上的Hamming距离，实现k模式聚类分析。从理论上证明了我们的摄动方法满足LDP且是无偏的。与最先进的方法相比，我们的方案更准确地估计高维分类数据之间的汉明距离，减少了由于只有一次交互而导致的通信开销。大量的实验证明了我们的方法在四个真实的高维类别数据集上的有效性，在相同的隐私预算下，我们的方案具有较小的归一化聚类内方差和较大的纯度指数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Non-interactive K-mode clustering of high-dimensional categorical data under local differential privacy

High-dimensional categorical data contains rich user-sensitive information, which poses huge privacy threats to users once leaked during data clustering and analysis. The existing K-mode method under local differential privacy (LDP) always requires multiple user-server interactions, which not only has high communication overhead and computational cost but also makes user privacy vulnerable to malicious attacks during interactions. In this paper, we propose a non-interactive LDP K-mode clustering estimation method for high-dimensional categorical data. We first perform dimensionality reduction on each user data locally through the Fsketch algorithm. Then, we perturb the sketch data, ensuring the perturbation satisfies LDP. The perturbed data is then submitted to the server for K-mode clustering. Finally, on the server, we directly estimate the Hamming distance on the perturbed data to achieve K-mode clustering analysis. It is theoretically proven that our perturbation method satisfies LDP and is unbiased. Compared with state-of-the-art methods, our scheme more accurately estimates the Hamming distance between high-dimensional categorical data, reducing communication overhead due to only one interaction. Extensive experiments demonstrate the effectiveness of our method on four real high-dimensional category data sets, in which our scheme has a smaller normalized intra-cluster variance and larger purity index under the same privacy budget.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Sciences 工程技术-计算机：信息系统

CiteScore

14.00

自引率

17.30%

发文量

1322

审稿时长

10.4 months

期刊介绍： Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions. Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.