Localized Vision-Language Matching for Open-vocabulary Object Detection

German Conference on Pattern Recognition Pub Date : 2022-05-12 DOI:10.48550/arXiv.2205.06160

M. A. Bravo, Sudhanshu Mittal, T. Brox

引用次数: 13

Abstract

In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-vocabulary detection approaches while being data-efficient. Source code is available at https://github.com/lmb-freiburg/locov .

查看原文本刊更多论文

面向开放词汇目标检测的局部视觉语言匹配

在这项工作中，我们提出了一种开放词汇表对象检测方法，该方法基于图像标题对，学习检测新的对象类以及给定的已知类集。它是一种两阶段的训练方法，首先使用位置引导的图像标题匹配技术以弱监督的方式学习新类和已知类的类标签，然后使用已知类注释专门用于目标检测任务的模型。我们证明了一个简单的语言模型比一个大的上下文化语言模型更适合于检测新对象。此外，我们还引入了一种一致性正则化技术来更好地利用图像标题对信息。我们的方法在数据效率方面优于现有的开放词汇检测方法。源代码可从https://github.com/lmb-freiburg/locov获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

German Conference on Pattern Recognition

自引率

0.00%

发文量