Ke Pan , Yuxin Wen , Yiming Wang , Maoguo Gong , Hui Li , Shanfeng Wang
{"title":"Privacy-enhanced data distillation with probability distribution matching","authors":"Ke Pan , Yuxin Wen , Yiming Wang , Maoguo Gong , Hui Li , Shanfeng Wang","doi":"10.1016/j.neucom.2025.130953","DOIUrl":null,"url":null,"abstract":"<div><div>Data distillation aims to condense the large-scale original dataset into a small-scale synthetic dataset while preserving as much data utility as possible. As one of the typical implementation mechanisms of data distillation, distribution matching works by aligning the feature distributions of synthetic and original samples, while avoiding the expensive computation and memory costs associated with other matching mechanisms. However, there still exist two primary limitations in distribution matching. On the one hand, distribution matching suffers from inadequate class discrimination, the synthetic samples within the same class may be misclassified as other classes due to the scattered feature distribution. On the other hand, distribution matching raises serious privacy concerns, as the synthetic dataset may inadvertently contain some sensitive information extracted from the original dataset. Taking this cue, we propose here a novel privacy-enhanced distribution matching-based data distillation algorithm. First, we design a probability distribution matching method with intra-class aggregation constraint and inter-class dispersion constraint based on symmetric Kullback-Leibler divergence to strengthen the performance of data distillation. Second, we design a dynamic noise perturbation method based on differential privacy to enhance data privacy guarantees while preserving higher sample quality. Extensive experiments demonstrate that our algorithm can achieve performance improvements of up to 4.5 % on the CIFAR10 dataset and 2.7 % on the SVHN dataset, compared to the state-of-the-art methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 130953"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S092523122501625X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Data distillation aims to condense the large-scale original dataset into a small-scale synthetic dataset while preserving as much data utility as possible. As one of the typical implementation mechanisms of data distillation, distribution matching works by aligning the feature distributions of synthetic and original samples, while avoiding the expensive computation and memory costs associated with other matching mechanisms. However, there still exist two primary limitations in distribution matching. On the one hand, distribution matching suffers from inadequate class discrimination, the synthetic samples within the same class may be misclassified as other classes due to the scattered feature distribution. On the other hand, distribution matching raises serious privacy concerns, as the synthetic dataset may inadvertently contain some sensitive information extracted from the original dataset. Taking this cue, we propose here a novel privacy-enhanced distribution matching-based data distillation algorithm. First, we design a probability distribution matching method with intra-class aggregation constraint and inter-class dispersion constraint based on symmetric Kullback-Leibler divergence to strengthen the performance of data distillation. Second, we design a dynamic noise perturbation method based on differential privacy to enhance data privacy guarantees while preserving higher sample quality. Extensive experiments demonstrate that our algorithm can achieve performance improvements of up to 4.5 % on the CIFAR10 dataset and 2.7 % on the SVHN dataset, compared to the state-of-the-art methods.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.