语言条件下的多尺度视觉注意力网络，促进视觉接地

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2024-08-25 DOI:10.1016/j.imavis.2024.105242

Haibo Yao, Lipeng Wang, Chengtao Cai, Wei Wang, Zhi Zhang, Xiaobing Shang

{"title":"语言条件下的多尺度视觉注意力网络，促进视觉接地","authors":"Haibo Yao, Lipeng Wang, Chengtao Cai, Wei Wang, Zhi Zhang, Xiaobing Shang","doi":"10.1016/j.imavis.2024.105242","DOIUrl":null,"url":null,"abstract":"<div><p>Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO+ and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105242"},"PeriodicalIF":4.2000,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language conditioned multi-scale visual attention networks for visual grounding\",\"authors\":\"Haibo Yao, Lipeng Wang, Chengtao Cai, Wei Wang, Zhi Zhang, Xiaobing Shang\",\"doi\":\"10.1016/j.imavis.2024.105242\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO+ and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.</p></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"150 \",\"pages\":\"Article 105242\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624003470\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003470","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉定位（VG）是一项需要根据自然语言表达在图像中定位特定区域的任务。现有的视觉定位方法分为两阶段法、一阶段法和基于变换器的方法，这些方法都取得了很高的性能。然而，以往的方法大多只提取单一空间尺度的视觉信息，而忽略了其他空间尺度的视觉信息，这使得这些模型无法充分利用视觉信息。此外，对语言信息的利用不足，尤其是未能捕捉到全局语言信息，可能导致无法完全理解语言表达，从而限制了这些模型的性能。为了更好地解决这一任务，我们提出了一种用于视觉接地的语言条件多尺度视觉注意力网络（LMSVA），它可以充分利用视觉和语言信息进行多模态推理，从而提高模型的性能。具体来说，我们设计了一个包含多尺度层的视觉特征提取器，通过扩展原始骨干层来获取所需的多尺度视觉特征。此外，我们还利用预训练的变压器双向编码器表征（BERT）模型的池化输出来提取句子级语言特征，从而使模型能够捕捉全局语言信息。受 Transformer 架构的启发，我们提出了由语言和多尺度视觉特征引导的视觉注意层（VALMS），它能够在多尺度视觉和语言特征的引导下更好地学习视觉上下文，并促进进一步的多模态推理。在四个大型基准数据集（包括 ReferItGame、RefCOCO、RefCOCO + 和 RefCOCOg）上进行的广泛实验证明，我们提出的模型达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Language conditioned multi-scale visual attention networks for visual grounding

Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO + and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.