IMC-Det：用于视频物体检测的跨模态对比学习

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2024-08-23 DOI:10.1007/s11263-024-02201-9

Qiang Qi, Zhenyu Qiu, Yan Yan, Yang Lu, Hanzi Wang

{"title":"IMC-Det：用于视频物体检测的跨模态对比学习","authors":"Qiang Qi, Zhenyu Qiu, Yan Yan, Yang Lu, Hanzi Wang","doi":"10.1007/s11263-024-02201-9","DOIUrl":null,"url":null,"abstract":"<p>Video object detection is an important yet challenging task in the computer vision field. One limitation of off-the-shelf video object detection methods is that they only explore information from the visual modality, without considering the semantic knowledge of the textual modality due to the large inter-modality discrepancies, resulting in limited detection performance. In this paper, we propose a novel intra–inter modality contrastive learning network for high-performance video object detection (IMC-Det), which includes three substantial improvements over existing methods. First, we design an intra-modality contrastive learning module to pull close similar features while pushing apart dissimilar ones, enabling our IMC-Det to learn more discriminative feature representations. Second, we develop a graph relational feature aggregation module to effectively model the structural relations between features by leveraging cross-graph learning and residual graph convolution, which is conducive to performing more effective feature aggregation in the spatio-temporal domain. Third, we present an inter-modality contrastive learning module to enforce the visual features belonging to same classes to be compactly gathered around the corresponding textual semantic representations, endowing our IMC-Det with better object classification capability. We conduct extensive experiments on the challenging ImageNet VID dataset, and the experimental results demonstrate that our IMC-Det performs favorably against existing state-of-the-art methods. More remarkably, our IMC-Det achieves 85.5% mAP and 86.7% mAP with ResNet-101 and ResNeXt-101, respectively.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"26 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"IMC-Det: Intra–Inter Modality Contrastive Learning for Video Object Detection\",\"authors\":\"Qiang Qi, Zhenyu Qiu, Yan Yan, Yang Lu, Hanzi Wang\",\"doi\":\"10.1007/s11263-024-02201-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Video object detection is an important yet challenging task in the computer vision field. One limitation of off-the-shelf video object detection methods is that they only explore information from the visual modality, without considering the semantic knowledge of the textual modality due to the large inter-modality discrepancies, resulting in limited detection performance. In this paper, we propose a novel intra–inter modality contrastive learning network for high-performance video object detection (IMC-Det), which includes three substantial improvements over existing methods. First, we design an intra-modality contrastive learning module to pull close similar features while pushing apart dissimilar ones, enabling our IMC-Det to learn more discriminative feature representations. Second, we develop a graph relational feature aggregation module to effectively model the structural relations between features by leveraging cross-graph learning and residual graph convolution, which is conducive to performing more effective feature aggregation in the spatio-temporal domain. Third, we present an inter-modality contrastive learning module to enforce the visual features belonging to same classes to be compactly gathered around the corresponding textual semantic representations, endowing our IMC-Det with better object classification capability. We conduct extensive experiments on the challenging ImageNet VID dataset, and the experimental results demonstrate that our IMC-Det performs favorably against existing state-of-the-art methods. More remarkably, our IMC-Det achieves 85.5% mAP and 86.7% mAP with ResNet-101 and ResNeXt-101, respectively.</p>\",\"PeriodicalId\":13752,\"journal\":{\"name\":\"International Journal of Computer Vision\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":11.6000,\"publicationDate\":\"2024-08-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computer Vision\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11263-024-02201-9\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02201-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视频对象检测是计算机视觉领域一项重要而又具有挑战性的任务。现成的视频对象检测方法存在一个局限性，即由于模态间差异较大，它们只探索视觉模态的信息，而不考虑文本模态的语义知识，从而导致检测性能有限。在本文中，我们提出了一种用于高性能视频对象检测的新型跨模态对比学习网络（IMC-Det），与现有方法相比，它有三个实质性的改进。首先，我们设计了一个模态内对比学习模块，以拉近相似特征，同时推开不相似特征，从而使我们的 IMC-Det 能够学习更具区分性的特征表征。其次，我们开发了图关系特征聚合模块，通过跨图学习和残差图卷积，有效地模拟特征之间的结构关系，这有利于在时空领域进行更有效的特征聚合。第三，我们提出了跨模态对比学习模块，以强制将属于同一类别的视觉特征紧凑地聚集在相应的文本语义表征周围，从而赋予 IMC-Det 更强的对象分类能力。我们在极具挑战性的 ImageNet VID 数据集上进行了大量实验，实验结果表明，与现有的先进方法相比，我们的 IMC-Det 表现出色。更值得一提的是，我们的 IMC-Det 与 ResNet-101 和 ResNeXt-101 的 mAP 比重分别达到了 85.5% 和 86.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

IMC-Det: Intra–Inter Modality Contrastive Learning for Video Object Detection

查看原文本刊更多论文

IMC-Det: Intra–Inter Modality Contrastive Learning for Video Object Detection

Video object detection is an important yet challenging task in the computer vision field. One limitation of off-the-shelf video object detection methods is that they only explore information from the visual modality, without considering the semantic knowledge of the textual modality due to the large inter-modality discrepancies, resulting in limited detection performance. In this paper, we propose a novel intra–inter modality contrastive learning network for high-performance video object detection (IMC-Det), which includes three substantial improvements over existing methods. First, we design an intra-modality contrastive learning module to pull close similar features while pushing apart dissimilar ones, enabling our IMC-Det to learn more discriminative feature representations. Second, we develop a graph relational feature aggregation module to effectively model the structural relations between features by leveraging cross-graph learning and residual graph convolution, which is conducive to performing more effective feature aggregation in the spatio-temporal domain. Third, we present an inter-modality contrastive learning module to enforce the visual features belonging to same classes to be compactly gathered around the corresponding textual semantic representations, endowing our IMC-Det with better object classification capability. We conduct extensive experiments on the challenging ImageNet VID dataset, and the experimental results demonstrate that our IMC-Det performs favorably against existing state-of-the-art methods. More remarkably, our IMC-Det achieves 85.5% mAP and 86.7% mAP with ResNet-101 and ResNeXt-101, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.