{"title":"You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding","authors":"Qing Du, Yucheng Luo","doi":"10.1109/ICDCSW56584.2022.00035","DOIUrl":null,"url":null,"abstract":"Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.","PeriodicalId":357138,"journal":{"name":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCSW56584.2022.00035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.