用于细粒度视觉分类的双级零件蒸馏网络

IF 2.7 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication Pub Date : 2025-07-16 DOI:10.1016/j.image.2025.117383

Xiangfen Zhang , Shitao Hong , Haixia Luo , Zhen Jiang , Feiniu Yuan

{"title":"用于细粒度视觉分类的双级零件蒸馏网络","authors":"Xiangfen Zhang , Shitao Hong , Haixia Luo , Zhen Jiang , Feiniu Yuan","doi":"10.1016/j.image.2025.117383","DOIUrl":null,"url":null,"abstract":"<div><div>Fine-Grained Visual Categorization (FGVC) remains a formidable challenge due to large intra-class variation and small inter-class variation, which can only be recognized by local details. Existing methods adopt part detection modules to localize discriminative regions for extracting part-level features, which offer crucial supplementary information for FGVC. However, these methods suffer from high computational complexity stemming from part detection and part-level feature extraction, while also lacking connectivity between different parts. To solve these problems, we propose a Dual-level Part Distillation Network (DPD-Net) for FGVC. Our DPD-Net extracts features at both object and part levels. In the object level, we first use residual networks to extract middle and high level features for generating middle and high object-level predictions, and concatenate these two predictions to produce the final output. In the part level, we use a part detection module to localize discriminative parts for extracting part-level features, point-wisely add features of different parts to generate an averaged part-level prediction, and concatenate different part features to produce a concatenated part-level prediction. We use knowledge distillation to transfer information from the averaged and concatenated part-level predictions to the middle and high object-level predictions, respectively. To supervise the training of our method, we design five losses, namely the pair-wise consistency of detected parts, the concatenated final prediction, the averaged part-level prediction, the cosine-embedding loss, and the concatenated part-level prediction. Experimental results show that our DPD-Net achieves state-of-the-art performance on three Fine-Grained Visual Recognition benchmarks. In addition, our DPD-Net can be trained end-to-end without extra annotations.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117383"},"PeriodicalIF":2.7000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A dual-level part distillation network for fine-grained visual categorization\",\"authors\":\"Xiangfen Zhang , Shitao Hong , Haixia Luo , Zhen Jiang , Feiniu Yuan\",\"doi\":\"10.1016/j.image.2025.117383\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Fine-Grained Visual Categorization (FGVC) remains a formidable challenge due to large intra-class variation and small inter-class variation, which can only be recognized by local details. Existing methods adopt part detection modules to localize discriminative regions for extracting part-level features, which offer crucial supplementary information for FGVC. However, these methods suffer from high computational complexity stemming from part detection and part-level feature extraction, while also lacking connectivity between different parts. To solve these problems, we propose a Dual-level Part Distillation Network (DPD-Net) for FGVC. Our DPD-Net extracts features at both object and part levels. In the object level, we first use residual networks to extract middle and high level features for generating middle and high object-level predictions, and concatenate these two predictions to produce the final output. In the part level, we use a part detection module to localize discriminative parts for extracting part-level features, point-wisely add features of different parts to generate an averaged part-level prediction, and concatenate different part features to produce a concatenated part-level prediction. We use knowledge distillation to transfer information from the averaged and concatenated part-level predictions to the middle and high object-level predictions, respectively. To supervise the training of our method, we design five losses, namely the pair-wise consistency of detected parts, the concatenated final prediction, the averaged part-level prediction, the cosine-embedding loss, and the concatenated part-level prediction. Experimental results show that our DPD-Net achieves state-of-the-art performance on three Fine-Grained Visual Recognition benchmarks. In addition, our DPD-Net can be trained end-to-end without extra annotations.</div></div>\",\"PeriodicalId\":49521,\"journal\":{\"name\":\"Signal Processing-Image Communication\",\"volume\":\"138 \",\"pages\":\"Article 117383\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Signal Processing-Image Communication\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0923596525001298\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596525001298","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

由于类内变化大，类间变化小，只能通过局部细节进行识别，因此细粒度视觉分类（FGVC）仍然是一个艰巨的挑战。现有方法采用零件检测模块定位判别区域提取零件级特征，为FGVC提供重要的补充信息。然而，这些方法由于零件检测和零件级特征提取而导致计算复杂度高，并且缺乏零件之间的连通性。为了解决这些问题，我们提出了一种用于FGVC的双层部分蒸馏网络（DPD-Net）。我们的DPD-Net在对象和部件级别提取特征。在对象层，我们首先使用残差网络提取中高层特征，用于生成中高层对象层预测，并将这两个预测连接起来产生最终输出。在零件层，我们使用零件检测模块对判别零件进行局部化提取零件层特征，点明智地添加不同零件的特征生成平均零件层预测，并将不同零件的特征串联起来生成串联零件层预测。我们使用知识蒸馏将信息从平均和连接的部分级预测分别转移到中级和高级对象级预测。为了监督我们的方法的训练，我们设计了五种损失，即检测部分的成对一致性、串联最终预测、平均部分级预测、余弦嵌入损失和串联部分级预测。实验结果表明，我们的DPD-Net在三个细粒度视觉识别基准上达到了最先进的性能。此外，我们的DPD-Net可以在没有额外注释的情况下进行端到端训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A dual-level part distillation network for fine-grained visual categorization

Fine-Grained Visual Categorization (FGVC) remains a formidable challenge due to large intra-class variation and small inter-class variation, which can only be recognized by local details. Existing methods adopt part detection modules to localize discriminative regions for extracting part-level features, which offer crucial supplementary information for FGVC. However, these methods suffer from high computational complexity stemming from part detection and part-level feature extraction, while also lacking connectivity between different parts. To solve these problems, we propose a Dual-level Part Distillation Network (DPD-Net) for FGVC. Our DPD-Net extracts features at both object and part levels. In the object level, we first use residual networks to extract middle and high level features for generating middle and high object-level predictions, and concatenate these two predictions to produce the final output. In the part level, we use a part detection module to localize discriminative parts for extracting part-level features, point-wisely add features of different parts to generate an averaged part-level prediction, and concatenate different part features to produce a concatenated part-level prediction. We use knowledge distillation to transfer information from the averaged and concatenated part-level predictions to the middle and high object-level predictions, respectively. To supervise the training of our method, we design five losses, namely the pair-wise consistency of detected parts, the concatenated final prediction, the averaged part-level prediction, the cosine-embedding loss, and the concatenated part-level prediction. Experimental results show that our DPD-Net achieves state-of-the-art performance on three Fine-Grained Visual Recognition benchmarks. In addition, our DPD-Net can be trained end-to-end without extra annotations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Signal Processing-Image Communication 工程技术-工程：电子与电气

CiteScore

8.40

自引率

2.90%

发文量

138

审稿时长

5.2 months

期刊介绍： Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.