具有概率分布匹配的增强隐私的数据蒸馏

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Ke Pan , Yuxin Wen , Yiming Wang , Maoguo Gong , Hui Li , Shanfeng Wang
{"title":"具有概率分布匹配的增强隐私的数据蒸馏","authors":"Ke Pan ,&nbsp;Yuxin Wen ,&nbsp;Yiming Wang ,&nbsp;Maoguo Gong ,&nbsp;Hui Li ,&nbsp;Shanfeng Wang","doi":"10.1016/j.neucom.2025.130953","DOIUrl":null,"url":null,"abstract":"<div><div>Data distillation aims to condense the large-scale original dataset into a small-scale synthetic dataset while preserving as much data utility as possible. As one of the typical implementation mechanisms of data distillation, distribution matching works by aligning the feature distributions of synthetic and original samples, while avoiding the expensive computation and memory costs associated with other matching mechanisms. However, there still exist two primary limitations in distribution matching. On the one hand, distribution matching suffers from inadequate class discrimination, the synthetic samples within the same class may be misclassified as other classes due to the scattered feature distribution. On the other hand, distribution matching raises serious privacy concerns, as the synthetic dataset may inadvertently contain some sensitive information extracted from the original dataset. Taking this cue, we propose here a novel privacy-enhanced distribution matching-based data distillation algorithm. First, we design a probability distribution matching method with intra-class aggregation constraint and inter-class dispersion constraint based on symmetric Kullback-Leibler divergence to strengthen the performance of data distillation. Second, we design a dynamic noise perturbation method based on differential privacy to enhance data privacy guarantees while preserving higher sample quality. Extensive experiments demonstrate that our algorithm can achieve performance improvements of up to 4.5 % on the CIFAR10 dataset and 2.7 % on the SVHN dataset, compared to the state-of-the-art methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 130953"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Privacy-enhanced data distillation with probability distribution matching\",\"authors\":\"Ke Pan ,&nbsp;Yuxin Wen ,&nbsp;Yiming Wang ,&nbsp;Maoguo Gong ,&nbsp;Hui Li ,&nbsp;Shanfeng Wang\",\"doi\":\"10.1016/j.neucom.2025.130953\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Data distillation aims to condense the large-scale original dataset into a small-scale synthetic dataset while preserving as much data utility as possible. As one of the typical implementation mechanisms of data distillation, distribution matching works by aligning the feature distributions of synthetic and original samples, while avoiding the expensive computation and memory costs associated with other matching mechanisms. However, there still exist two primary limitations in distribution matching. On the one hand, distribution matching suffers from inadequate class discrimination, the synthetic samples within the same class may be misclassified as other classes due to the scattered feature distribution. On the other hand, distribution matching raises serious privacy concerns, as the synthetic dataset may inadvertently contain some sensitive information extracted from the original dataset. Taking this cue, we propose here a novel privacy-enhanced distribution matching-based data distillation algorithm. First, we design a probability distribution matching method with intra-class aggregation constraint and inter-class dispersion constraint based on symmetric Kullback-Leibler divergence to strengthen the performance of data distillation. Second, we design a dynamic noise perturbation method based on differential privacy to enhance data privacy guarantees while preserving higher sample quality. Extensive experiments demonstrate that our algorithm can achieve performance improvements of up to 4.5 % on the CIFAR10 dataset and 2.7 % on the SVHN dataset, compared to the state-of-the-art methods.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"652 \",\"pages\":\"Article 130953\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S092523122501625X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S092523122501625X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

数据蒸馏旨在将大规模的原始数据集压缩成小规模的合成数据集,同时尽可能多地保留数据的效用。分布匹配是一种典型的数据蒸馏实现机制,通过对合成样本和原始样本的特征分布进行比对,避免了其他匹配机制所带来的昂贵的计算和内存开销。然而,分布匹配仍然存在两个主要的局限性。一方面,分布匹配存在类别区分不充分的问题,同一类别内的合成样本可能由于特征分布分散而被误分类为其他类别。另一方面,分布匹配引起了严重的隐私问题,因为合成数据集可能无意中包含从原始数据集提取的一些敏感信息。基于此,我们提出了一种新的基于分布匹配的增强隐私的数据蒸馏算法。首先,我们设计了一种基于对称Kullback-Leibler散度的类内聚集约束和类间分散约束的概率分布匹配方法,以增强数据蒸馏的性能。其次,我们设计了一种基于差分隐私的动态噪声摄动方法,在保证数据隐私的同时保持较高的样本质量。大量的实验表明,与最先进的方法相比,我们的算法在CIFAR10数据集上的性能提高了4.5 %,在SVHN数据集上的性能提高了2.7 %。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Privacy-enhanced data distillation with probability distribution matching
Data distillation aims to condense the large-scale original dataset into a small-scale synthetic dataset while preserving as much data utility as possible. As one of the typical implementation mechanisms of data distillation, distribution matching works by aligning the feature distributions of synthetic and original samples, while avoiding the expensive computation and memory costs associated with other matching mechanisms. However, there still exist two primary limitations in distribution matching. On the one hand, distribution matching suffers from inadequate class discrimination, the synthetic samples within the same class may be misclassified as other classes due to the scattered feature distribution. On the other hand, distribution matching raises serious privacy concerns, as the synthetic dataset may inadvertently contain some sensitive information extracted from the original dataset. Taking this cue, we propose here a novel privacy-enhanced distribution matching-based data distillation algorithm. First, we design a probability distribution matching method with intra-class aggregation constraint and inter-class dispersion constraint based on symmetric Kullback-Leibler divergence to strengthen the performance of data distillation. Second, we design a dynamic noise perturbation method based on differential privacy to enhance data privacy guarantees while preserving higher sample quality. Extensive experiments demonstrate that our algorithm can achieve performance improvements of up to 4.5 % on the CIFAR10 dataset and 2.7 % on the SVHN dataset, compared to the state-of-the-art methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信