数据集蒸馏:综述

IF 20.8 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Ruonan Yu, Songhua Liu, Xinchao Wang
{"title":"数据集蒸馏:综述","authors":"Ruonan Yu, Songhua Liu, Xinchao Wang","doi":"10.48550/arXiv.2301.07014","DOIUrl":null,"url":null,"abstract":"Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":" ","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"Dataset Distillation: A Comprehensive Review\",\"authors\":\"Ruonan Yu, Songhua Liu, Xinchao Wang\",\"doi\":\"10.48550/arXiv.2301.07014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":20.8000,\"publicationDate\":\"2023-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2301.07014\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.07014","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 26

摘要

最近深度学习的成功很大程度上归功于用于训练深度神经网络的大量数据。尽管取得了前所未有的成功,但不幸的是,海量数据大大增加了存储和传输的负担,并进一步导致了繁琐的模型训练过程。此外,依靠原始数据进行培训本身就会引起对隐私和版权的担忧。为了克服这些缺点,数据集蒸馏(DD)也被称为数据集冷凝(DC),近年来引起了业界的广泛关注。给定原始数据集,DD旨在派生一个包含合成样本的小得多的数据集,在此基础上训练的模型产生与原始数据集训练的模型相当的性能。本文对近年来DD及其应用的研究进展进行了综述。我们首先正式介绍了该任务,并提出了一个遵循所有现有DD方法的总体算法框架。接下来,我们对这一领域的当前方法进行了系统的分类,并讨论了它们在理论上的相互联系。我们还通过广泛的实证研究提出了DD当前面临的挑战,并展望了未来工作的可能方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dataset Distillation: A Comprehensive Review
Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
28.40
自引率
3.00%
发文量
885
审稿时长
8.5 months
期刊介绍: The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信