你只看和听一次:快速和准确的视觉基础

Qing Du, Yucheng Luo
{"title":"你只看和听一次:快速和准确的视觉基础","authors":"Qing Du, Yucheng Luo","doi":"10.1109/ICDCSW56584.2022.00035","DOIUrl":null,"url":null,"abstract":"Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.","PeriodicalId":357138,"journal":{"name":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding\",\"authors\":\"Qing Du, Yucheng Luo\",\"doi\":\"10.1109/ICDCSW56584.2022.00035\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.\",\"PeriodicalId\":357138,\"journal\":{\"name\":\"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCSW56584.2022.00035\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCSW56584.2022.00035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

视觉基础(VG)的目的是定位图像中最相关的区域,基于灵活的自然语言查询,而不是预先定义的标签,因此它在实践中是一种有用的技术。VG中的大多数方法都是两阶段的,第一阶段使用目标检测器从输入图像中生成一组目标建议,第二阶段简单地表述为跨模态匹配问题。第一阶段可能产生数百个提案,需要在第二阶段进行比较,这对于实时VG应用程序来说是不可行的,并且第二阶段的性能可能会受到第一阶段的影响。在本文中,我们提出了一种更优雅的基于单阶段检测的方法,该方法将区域建议和匹配阶段连接为一个单一的检测网络。检测以输入查询为条件,使用一堆新颖的关系到注意模块,将图像到查询的关系转换为关系映射,用于直接预测边界框,而不会提出大量无用的区域建议。在推理过程中,我们的方法比以前的方法快20 ~ 30倍,值得注意的是,它在几个基准数据集上达到了相当的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding
Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a useful technique in practice. Most methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem. There might be hundreds of proposals produced in the first stage that need to be compared in the second stage, which is infeasible for real-time VG applications, and the performance of the second stage may be affected by the first stage. In this paper, we propose a much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to a relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20 x ~ 30 x faster than previous methods and, remarkably, it achieves comparable performance on several benchmark datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信