SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector.

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-08-19 DOI:10.1109/TPAMI.2025.3600435

Shuailei Ma, Yuefeng Wang, Ying Wei, Enming Zhang, Jiaqi Fan, Xinyu Sun, Peihao Chen

{"title":"SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector.","authors":"Shuailei Ma, Yuefeng Wang, Ying Wei, Enming Zhang, Jiaqi Fan, Xinyu Sun, Peihao Chen","doi":"10.1109/TPAMI.2025.3600435","DOIUrl":null,"url":null,"abstract":"<p><p>Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to localize all potential unseen/unknown objects and incrementally learn them. The large pre-trained vision-language grounding models (VLM, e.g., GLIP) have rich knowledge about the open world, but are limited by text prompts and cannot localize indescribable objects. However, there are many detection scenarios in which pre-defined language descriptions are unavailable during inference. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the simple knowledge distillation approach leads to unexpected performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures, leading to catastrophic damage to the model's ability to learn about known objects. To alleviate these problems, we propose the down-weight training strategy for knowledge distillation from vision-language model to single visual modality one. Meanwhile, we propose the cascade decoupled decoders that decouple the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we refine the benchmark for evaluating the performance of unknown object detection by augmenting annotations for unknown objects which we name\"IntensiveSet$\\spadesuit$\". Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2025.3600435","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to localize all potential unseen/unknown objects and incrementally learn them. The large pre-trained vision-language grounding models (VLM, e.g., GLIP) have rich knowledge about the open world, but are limited by text prompts and cannot localize indescribable objects. However, there are many detection scenarios in which pre-defined language descriptions are unavailable during inference. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the simple knowledge distillation approach leads to unexpected performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures, leading to catastrophic damage to the model's ability to learn about known objects. To alleviate these problems, we propose the down-weight training strategy for knowledge distillation from vision-language model to single visual modality one. Meanwhile, we propose the cascade decoupled decoders that decouple the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we refine the benchmark for evaluating the performance of unknown object detection by augmenting annotations for unknown objects which we name"IntensiveSet$\spadesuit$". Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods.

查看原文本刊更多论文

SKDF：将开放词汇知识提炼到开放世界对象检测器的简单知识蒸馏框架。

开放世界目标检测（OWOD）是一项具有相当大挑战的新型计算机视觉任务，它弥合了经典目标检测（OD）和现实世界目标检测之间的差距。除了检测和分类可见/已知对象外，OWOD算法还有望定位所有潜在的未见/未知对象并逐步学习它们。大型预训练的视觉语言基础模型（VLM，如GLIP）具有丰富的开放世界知识，但受文本提示的限制，无法定位不可描述的对象。然而，在许多检测场景中，预定义的语言描述在推理期间不可用。在本文中，我们试图通过将其开放世界知识提炼成语言不可知检测器来专门化用于OWOD任务的VLM模型。令人惊讶的是，我们观察到简单的知识蒸馏方法即使在少量数据的情况下也会导致未知对象检测的意想不到的性能。不幸的是，未知对象的知识蒸馏严重影响了传统结构检测器的学习，导致模型学习已知对象的能力受到灾难性的破坏。为了解决这些问题，我们提出了从视觉语言模型到单一视觉模态模型的知识升华的降权训练策略。同时，我们提出了级联解耦解码器，将定位和识别的学习解耦，以减少已知和未知对象的类别交互对定位学习过程的影响。烧蚀实验表明，这两种方法都能有效缓解开放世界知识蒸馏对已知对象学习的影响。此外，为了缓解目前缺乏评估开放世界检测器在开放世界中检测未知物体能力的综合基准的问题，我们通过增加未知物体的注释来改进评估未知物体检测性能的基准，我们将其命名为“intenveset $\spadesuit$”。在OWOD， MS-COCO和我们提出的基准上进行的综合实验证明了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量