Visual Question Answering Based on Position Alignment

2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) Pub Date : 2021-10-23 DOI:10.1109/CISP-BMEI53629.2021.9624447

Qihao Xia, Chao Yu, Pingping Peng, Henghao Gu, Zhengqi Zheng, Kun Zhao

{"title":"Visual Question Answering Based on Position Alignment","authors":"Qihao Xia, Chao Yu, Pingping Peng, Henghao Gu, Zhengqi Zheng, Kun Zhao","doi":"10.1109/CISP-BMEI53629.2021.9624447","DOIUrl":null,"url":null,"abstract":"The alignment of information from images and questions is of great significance in the visual question answering task. Whether an object in image is related to the question or not is the basic judgement relied on the feature alignment. Many previous works have proposed different alignment methods to build better cross modality interaction. The attention mechanism is the most used method in making alignment. The classical bottom up and top down model builds a top down attention distribution by concatenating question features to each image features and calculates the attention weights between question and image. However, the bottom up and top down model didn't consider the positional information in image and question. In this paper, we revisit the attention distribution from a position perspective which aligns question to object's positional information. We first embed the positional information of each object in image and calculate a position attention distribution to indicate the relevance between objects' positions in context of the current question. Through the attention distribution model can select the related position in image to answer the given question. The position attention distribution is concatenated to the feature attention distribution to get the final distribution. We evaluate our method on visual question answering (VQA2.0) dataset, and show that our method is effective in multimodal alignment.","PeriodicalId":131256,"journal":{"name":"2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","volume":"1 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISP-BMEI53629.2021.9624447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The alignment of information from images and questions is of great significance in the visual question answering task. Whether an object in image is related to the question or not is the basic judgement relied on the feature alignment. Many previous works have proposed different alignment methods to build better cross modality interaction. The attention mechanism is the most used method in making alignment. The classical bottom up and top down model builds a top down attention distribution by concatenating question features to each image features and calculates the attention weights between question and image. However, the bottom up and top down model didn't consider the positional information in image and question. In this paper, we revisit the attention distribution from a position perspective which aligns question to object's positional information. We first embed the positional information of each object in image and calculate a position attention distribution to indicate the relevance between objects' positions in context of the current question. Through the attention distribution model can select the related position in image to answer the given question. The position attention distribution is concatenated to the feature attention distribution to get the final distribution. We evaluate our method on visual question answering (VQA2.0) dataset, and show that our method is effective in multimodal alignment.

查看原文本刊更多论文

基于位置对齐的可视化问答

图像和问题信息的对齐在视觉问答任务中具有重要意义。图像中的目标是否与问题相关是基于特征对齐的基本判断。许多先前的工作提出了不同的对齐方法来建立更好的跨模态交互。注意机制是最常用的对齐方法。经典的自底向上和自顶向下模型通过将问题特征连接到每个图像特征上来构建自顶向下的注意力分布，并计算问题和图像之间的注意力权重。然而，自底向上和自顶向下模型没有考虑图像和问题中的位置信息。在本文中，我们从位置的角度重新审视了注意力的分布，将问题与对象的位置信息对齐。我们首先在图像中嵌入每个物体的位置信息，并计算位置注意力分布，以表明当前问题背景下物体位置之间的相关性。通过注意力分布模型可以选择图像中的相关位置来回答给定的问题。将位置注意分布与特征注意分布连接起来，得到最终的分布。在视觉问答(VQA2.0)数据集上对该方法进行了评价，结果表明该方法在多模态对齐中是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)

自引率

0.00%

发文量