DGC-Net: Dynamic Graph Contrastive Network for Video Object Detection

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-19 DOI:10.1109/TIP.2025.3551158

Qiang Qi;Hanzi Wang;Yan Yan;Xuelong Li

{"title":"DGC-Net: Dynamic Graph Contrastive Network for Video Object Detection","authors":"Qiang Qi;Hanzi Wang;Yan Yan;Xuelong Li","doi":"10.1109/TIP.2025.3551158","DOIUrl":null,"url":null,"abstract":"Video object detection is a challenging task in computer vision since it needs to handle the object appearance degradation problem that seldom occurs in the image domain. Off-the-shelf video object detection methods typically aggregate multi-frame features at one stroke to alleviate appearance degradation. However, these existing methods do not take supervision knowledge into consideration and thus still suffer from insufficient feature aggregation, resulting in the false detection problem. In this paper, we take a different perspective on feature aggregation, and propose a dynamic graph contrastive network (DGC-Net) for video object detection, including three improvements against existing methods. First, we design a frame-level graph contrastive module to aggregate frame features, enabling our DGC-Net to fully exploit discriminative contextual feature representations to facilitate video object detection. Second, we develop a proposal-level graph contrastive module to aggregate proposal features, making our DGC-Net sufficiently learn discriminative semantic feature representations. Third, we present a graph transformer to dynamically adjust the graph structure by pruning the useless nodes and edges, which contributes to improving accuracy and efficiency as it can eliminate the geometric-semantic ambiguity and reduce the graph scale. Furthermore, inherited from the framework of DGC-Net, we develop DGC-Net Lite to perform real-time video object detection with a much faster inference speed. Extensive experiments conducted on the ImageNet VID dataset demonstrate that our DGC-Net outperforms the performance of current state-of-the-art methods. Notably, our DGC-Net obtains 86.3%/87.3% mAP when using ResNet-101/ResNeXt-101.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2269-2284"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10934730/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video object detection is a challenging task in computer vision since it needs to handle the object appearance degradation problem that seldom occurs in the image domain. Off-the-shelf video object detection methods typically aggregate multi-frame features at one stroke to alleviate appearance degradation. However, these existing methods do not take supervision knowledge into consideration and thus still suffer from insufficient feature aggregation, resulting in the false detection problem. In this paper, we take a different perspective on feature aggregation, and propose a dynamic graph contrastive network (DGC-Net) for video object detection, including three improvements against existing methods. First, we design a frame-level graph contrastive module to aggregate frame features, enabling our DGC-Net to fully exploit discriminative contextual feature representations to facilitate video object detection. Second, we develop a proposal-level graph contrastive module to aggregate proposal features, making our DGC-Net sufficiently learn discriminative semantic feature representations. Third, we present a graph transformer to dynamically adjust the graph structure by pruning the useless nodes and edges, which contributes to improving accuracy and efficiency as it can eliminate the geometric-semantic ambiguity and reduce the graph scale. Furthermore, inherited from the framework of DGC-Net, we develop DGC-Net Lite to perform real-time video object detection with a much faster inference speed. Extensive experiments conducted on the ImageNet VID dataset demonstrate that our DGC-Net outperforms the performance of current state-of-the-art methods. Notably, our DGC-Net obtains 86.3%/87.3% mAP when using ResNet-101/ResNeXt-101.

查看原文本刊更多论文

DGC-Net：视频目标检测的动态图对比网络

视频目标检测是计算机视觉中的一项具有挑战性的任务，因为它需要处理在图像域中很少发生的目标外观退化问题。现有的视频目标检测方法通常一次聚合多帧特征以减轻外观退化。但是，现有的这些方法没有考虑到监督知识，特征聚合不足，存在误检问题。本文从特征聚合的不同角度，提出了一种动态图对比网络（DGC-Net）用于视频目标检测，包括对现有方法的三种改进。首先，我们设计了一个帧级图形对比模块来聚合帧特征，使我们的DGC-Net能够充分利用判别上下文特征表示来促进视频目标检测。其次，我们开发了提案级图对比模块来聚合提案特征，使我们的DGC-Net充分学习判别语义特征表示。第三，提出了一种图转换器，通过修剪无用的节点和边缘来动态调整图的结构，消除了图的几何语义歧义，减小了图的规模，提高了图的精度和效率。此外，我们继承DGC-Net的框架，开发了DGC-Net Lite，以更快的推理速度进行实时视频目标检测。在ImageNet VID数据集上进行的大量实验表明，我们的DGC-Net的性能优于当前最先进的方法。值得注意的是，我们的DGC-Net在使用ResNet-101/ResNeXt-101时获得了86.3%/87.3%的mAP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量