DDMCB: Open-world object detection empowered by Denoising Diffusion Models and Calibration Balance

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-03-19 DOI:10.1016/j.imavis.2025.105508

Yangyang Huang, Xing Xi, Ronghua Luo

{"title":"DDMCB: Open-world object detection empowered by Denoising Diffusion Models and Calibration Balance","authors":"Yangyang Huang, Xing Xi, Ronghua Luo","doi":"10.1016/j.imavis.2025.105508","DOIUrl":null,"url":null,"abstract":"<div><div>Open-world object detection (OWOD) differs from traditional object detection by being more suited to real-world, dynamic scenarios. It aims to recognize unseen objects and have the skill to learn incrementally based on newly introduced knowledge. However, the current OWOD usually relies on supervising of known objects in identifying unknown objects, using high objectness scores as critical indicators of potential unknown objects. While these methods can detect unknown objects with features similar to known objects, they also classify regions dissimilar to known objects as background, leading to label bias issues. To address this problem, we leverage the knowledge from large visual models to provide auxiliary supervision for unknown objects. Additionally, we apply the Denoising Diffusion Probabilistic Model (DDPM) in OWOD scenarios. We propose an unsupervised modeling approach based on DDPM, which significantly improves the accuracy of unknown object detection. Despite this, the classifier trained during the model training process only encounters known classes, resulting in higher confidence for known classes during inference; thus, bias issues again occur. Therefore, we propose a probability calibration technique for post-processing predictions during inference. The calibration aims to reduce the probabilities of known objects and increase the probabilities of unknown objects, thereby balancing the final probability predictions. Our experiments demonstrate that the proposed method achieves significant improvements on OWOD benchmarks, with an unknown objects detection recall rate of <strong>54.7 U-Recall</strong>, surpassing the current state-of-the-art (SOTA) methods by <strong>44.3%</strong>. In terms of real-time performance, Our model uses a few parameters, and pure convolutional neural networks instead of intensive attention mechanisms, achieving an inference speed of <strong>35.04 FPS</strong>, exceeding the SOTA OWOD methods based on Faster R-CNN and Deformable DETR by <strong>2.79</strong> and <strong>10.95 FPS</strong>, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105508"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000964","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Open-world object detection (OWOD) differs from traditional object detection by being more suited to real-world, dynamic scenarios. It aims to recognize unseen objects and have the skill to learn incrementally based on newly introduced knowledge. However, the current OWOD usually relies on supervising of known objects in identifying unknown objects, using high objectness scores as critical indicators of potential unknown objects. While these methods can detect unknown objects with features similar to known objects, they also classify regions dissimilar to known objects as background, leading to label bias issues. To address this problem, we leverage the knowledge from large visual models to provide auxiliary supervision for unknown objects. Additionally, we apply the Denoising Diffusion Probabilistic Model (DDPM) in OWOD scenarios. We propose an unsupervised modeling approach based on DDPM, which significantly improves the accuracy of unknown object detection. Despite this, the classifier trained during the model training process only encounters known classes, resulting in higher confidence for known classes during inference; thus, bias issues again occur. Therefore, we propose a probability calibration technique for post-processing predictions during inference. The calibration aims to reduce the probabilities of known objects and increase the probabilities of unknown objects, thereby balancing the final probability predictions. Our experiments demonstrate that the proposed method achieves significant improvements on OWOD benchmarks, with an unknown objects detection recall rate of 54.7 U-Recall, surpassing the current state-of-the-art (SOTA) methods by 44.3%. In terms of real-time performance, Our model uses a few parameters, and pure convolutional neural networks instead of intensive attention mechanisms, achieving an inference speed of 35.04 FPS, exceeding the SOTA OWOD methods based on Faster R-CNN and Deformable DETR by 2.79 and 10.95 FPS, respectively.

查看原文本刊更多论文

DDMCB：基于去噪扩散模型和校准平衡的开放世界目标检测

开放世界对象检测（OWOD）与传统的对象检测不同，它更适合现实世界的动态场景。它旨在识别看不见的物体，并具有基于新引入的知识增量学习的技能。然而，目前的OWOD通常依赖于对已知对象的监督来识别未知对象，将高对象分数作为潜在未知对象的关键指标。虽然这些方法可以检测具有与已知物体相似特征的未知物体，但它们也会将与已知物体不相似的区域分类为背景，从而导致标签偏差问题。为了解决这个问题，我们利用来自大型视觉模型的知识来为未知对象提供辅助监督。此外，我们将去噪扩散概率模型（DDPM）应用于OWOD场景。提出了一种基于DDPM的无监督建模方法，显著提高了未知目标检测的准确性。尽管如此，在模型训练过程中训练的分类器只遇到已知的类，从而在推理过程中对已知类有更高的置信度；因此，偏见问题再次出现。因此，我们提出了一种概率校准技术，用于推理过程中的后处理预测。校准的目的是减少已知物体的概率，增加未知物体的概率，从而平衡最终的概率预测。我们的实验表明，所提出的方法在OWOD基准上取得了显著的改进，未知物体检测的召回率为54.7 U-Recall，比目前最先进的（SOTA）方法高出44.3%。在实时性方面，我们的模型使用较少的参数，使用纯卷积神经网络代替密集的注意力机制，实现了35.04 FPS的推理速度，比基于Faster R-CNN和Deformable DETR的SOTA OWOD方法分别提高了2.79和10.95 FPS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.