Xinxin Ye , Youwen Zhu , Shunsheng Zhang , Hai Deng , Pengfei Yu
{"title":"局部差分隐私下高维分类数据的非交互k型聚类","authors":"Xinxin Ye , Youwen Zhu , Shunsheng Zhang , Hai Deng , Pengfei Yu","doi":"10.1016/j.ins.2025.122417","DOIUrl":null,"url":null,"abstract":"<div><div>High-dimensional categorical data contains rich user-sensitive information, which poses huge privacy threats to users once leaked during data clustering and analysis. The existing <em>K</em>-mode method under local differential privacy (LDP) always requires multiple user-server interactions, which not only has high communication overhead and computational cost but also makes user privacy vulnerable to malicious attacks during interactions. In this paper, we propose a non-interactive LDP <em>K</em>-mode clustering estimation method for high-dimensional categorical data. We first perform dimensionality reduction on each user data locally through the Fsketch algorithm. Then, we perturb the sketch data, ensuring the perturbation satisfies LDP. The perturbed data is then submitted to the server for <em>K</em>-mode clustering. Finally, on the server, we directly estimate the Hamming distance on the perturbed data to achieve <em>K</em>-mode clustering analysis. It is theoretically proven that our perturbation method satisfies LDP and is unbiased. Compared with state-of-the-art methods, our scheme more accurately estimates the Hamming distance between high-dimensional categorical data, reducing communication overhead due to only one interaction. Extensive experiments demonstrate the effectiveness of our method on four real high-dimensional category data sets, in which our scheme has a smaller normalized intra-cluster variance and larger purity index under the same privacy budget.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"718 ","pages":"Article 122417"},"PeriodicalIF":6.8000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Non-interactive K-mode clustering of high-dimensional categorical data under local differential privacy\",\"authors\":\"Xinxin Ye , Youwen Zhu , Shunsheng Zhang , Hai Deng , Pengfei Yu\",\"doi\":\"10.1016/j.ins.2025.122417\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>High-dimensional categorical data contains rich user-sensitive information, which poses huge privacy threats to users once leaked during data clustering and analysis. The existing <em>K</em>-mode method under local differential privacy (LDP) always requires multiple user-server interactions, which not only has high communication overhead and computational cost but also makes user privacy vulnerable to malicious attacks during interactions. In this paper, we propose a non-interactive LDP <em>K</em>-mode clustering estimation method for high-dimensional categorical data. We first perform dimensionality reduction on each user data locally through the Fsketch algorithm. Then, we perturb the sketch data, ensuring the perturbation satisfies LDP. The perturbed data is then submitted to the server for <em>K</em>-mode clustering. Finally, on the server, we directly estimate the Hamming distance on the perturbed data to achieve <em>K</em>-mode clustering analysis. It is theoretically proven that our perturbation method satisfies LDP and is unbiased. Compared with state-of-the-art methods, our scheme more accurately estimates the Hamming distance between high-dimensional categorical data, reducing communication overhead due to only one interaction. Extensive experiments demonstrate the effectiveness of our method on four real high-dimensional category data sets, in which our scheme has a smaller normalized intra-cluster variance and larger purity index under the same privacy budget.</div></div>\",\"PeriodicalId\":51063,\"journal\":{\"name\":\"Information Sciences\",\"volume\":\"718 \",\"pages\":\"Article 122417\"},\"PeriodicalIF\":6.8000,\"publicationDate\":\"2025-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0020025525005493\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025525005493","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Non-interactive K-mode clustering of high-dimensional categorical data under local differential privacy
High-dimensional categorical data contains rich user-sensitive information, which poses huge privacy threats to users once leaked during data clustering and analysis. The existing K-mode method under local differential privacy (LDP) always requires multiple user-server interactions, which not only has high communication overhead and computational cost but also makes user privacy vulnerable to malicious attacks during interactions. In this paper, we propose a non-interactive LDP K-mode clustering estimation method for high-dimensional categorical data. We first perform dimensionality reduction on each user data locally through the Fsketch algorithm. Then, we perturb the sketch data, ensuring the perturbation satisfies LDP. The perturbed data is then submitted to the server for K-mode clustering. Finally, on the server, we directly estimate the Hamming distance on the perturbed data to achieve K-mode clustering analysis. It is theoretically proven that our perturbation method satisfies LDP and is unbiased. Compared with state-of-the-art methods, our scheme more accurately estimates the Hamming distance between high-dimensional categorical data, reducing communication overhead due to only one interaction. Extensive experiments demonstrate the effectiveness of our method on four real high-dimensional category data sets, in which our scheme has a smaller normalized intra-cluster variance and larger purity index under the same privacy budget.
期刊介绍:
Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions.
Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.