Toward Effective Knowledge Distillation: Navigating Beyond Small-data Pitfall.

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-09-09 DOI:10.1109/tpami.2025.3607982

Zhiwei Hao,Jianyuan Guo,Kai Han,Han Hu,Chang Xu,Yunhe Wang

{"title":"Toward Effective Knowledge Distillation: Navigating Beyond Small-data Pitfall.","authors":"Zhiwei Hao,Jianyuan Guo,Kai Han,Han Hu,Chang Xu,Yunhe Wang","doi":"10.1109/tpami.2025.3607982","DOIUrl":null,"url":null,"abstract":"The spectacular success of training large models on extensive datasets highlights the potential of scaling up for exceptional performance. To deploy these models on edge devices, knowledge distillation (KD) is commonly used to create a compact model from a larger, pretrained teacher model. However, as models and datasets rapidly scale up in practical applications, it is crucial to consider the applicability of existing KD approaches originally designed for limited-capacity architectures and small-scale datasets. In this paper, we revisit current KD methods and identify the presence of a small-data pitfall, where most modifications to vanilla KD prove ineffective on large-scale datasets. To guide the design of consistently effective KD methods across different data scales, we conduct a meticulous evaluation of the knowledge transfer process. Our findings reveal that incorporating more useful information is crucial for achieving consistently effective KD methods, while modifications in loss functions show relatively less significance. In light of this, we present a paradigmatic example that combines vanilla KD with deep supervision, incorporating additional information into the student during distillation. This approach surpasses almost all recent KD methods. We believe our study will offer valuable insights to guide the community in navigating beyond the small-data pitfall and toward consistently effective KD.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"32 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3607982","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The spectacular success of training large models on extensive datasets highlights the potential of scaling up for exceptional performance. To deploy these models on edge devices, knowledge distillation (KD) is commonly used to create a compact model from a larger, pretrained teacher model. However, as models and datasets rapidly scale up in practical applications, it is crucial to consider the applicability of existing KD approaches originally designed for limited-capacity architectures and small-scale datasets. In this paper, we revisit current KD methods and identify the presence of a small-data pitfall, where most modifications to vanilla KD prove ineffective on large-scale datasets. To guide the design of consistently effective KD methods across different data scales, we conduct a meticulous evaluation of the knowledge transfer process. Our findings reveal that incorporating more useful information is crucial for achieving consistently effective KD methods, while modifications in loss functions show relatively less significance. In light of this, we present a paradigmatic example that combines vanilla KD with deep supervision, incorporating additional information into the student during distillation. This approach surpasses almost all recent KD methods. We believe our study will offer valuable insights to guide the community in navigating beyond the small-data pitfall and toward consistently effective KD.

查看原文本刊更多论文

走向有效的知识蒸馏：跨越小数据陷阱。

在广泛的数据集上训练大型模型的惊人成功突出了扩展卓越性能的潜力。为了在边缘设备上部署这些模型，知识蒸馏（KD）通常用于从更大的、预训练的教师模型创建紧凑的模型。然而，随着模型和数据集在实际应用中的快速扩展，考虑现有KD方法的适用性至关重要，这些方法最初是为有限容量架构和小规模数据集设计的。在本文中，我们回顾了当前的KD方法，并确定了一个小数据陷阱的存在，其中大多数对普通KD的修改在大规模数据集上被证明是无效的。为了指导设计跨不同数据尺度的一致有效的KD方法，我们对知识转移过程进行了细致的评估。我们的研究结果表明，纳入更多有用的信息对于实现一致有效的KD方法至关重要，而损失函数的修改相对不那么重要。鉴于此，我们提出了一个范例，将香草KD与深度监督相结合，在蒸馏过程中向学生提供额外的信息。这种方法超越了几乎所有最近的KD方法。我们相信我们的研究将提供有价值的见解，指导社区克服小数据陷阱，走向持续有效的KD。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.