Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence Pub Date : 2024-02-20 DOI:10.1111/coin.12635

Mayank Agrawal, Anand Singh Jalal, Himanshu Sharma

{"title":"Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration","authors":"Mayank Agrawal, Anand Singh Jalal, Himanshu Sharma","doi":"10.1111/coin.12635","DOIUrl":null,"url":null,"abstract":"<p>Visual question answering (VQA) is a challenging task in computer vision. Recently, there has been a growing interest in text-based VQA tasks, emphasizing the important role of textual information for better understanding of images. Effectively utilizing text information within the image is crucial for achieving success in this task. However, existing approaches often overlook the contextual information and neglect to utilize the relationships between scene-text tokens and image objects. They simply incorporate the scene-text tokens mined from the image into the VQA model without considering these important factors. In this paper, the proposed model initially analyzes the image to extract text and identify scene objects. It then comprehends the question and mines relationships among the question, OCRed text, and scene objects, ultimately generating an answer through relational reasoning by conducting semantic and positional attention. Our decoder with attention map loss enables prediction of complex answers and handles dynamic vocabularies, reducing decoding space. It outperforms softmax-based cross entropy loss in accuracy and efficiency by accommodating varying vocabulary sizes. We evaluated our model's performance on the TextVQA dataset and achieved an accuracy of 53.91% on the validation set and 53.98% on the test set. Moreover, on the ST-VQA dataset, our model obtained ANLS scores of 0.699 on the validation set and 0.692 on the test set.</p>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.12635","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Visual question answering (VQA) is a challenging task in computer vision. Recently, there has been a growing interest in text-based VQA tasks, emphasizing the important role of textual information for better understanding of images. Effectively utilizing text information within the image is crucial for achieving success in this task. However, existing approaches often overlook the contextual information and neglect to utilize the relationships between scene-text tokens and image objects. They simply incorporate the scene-text tokens mined from the image into the VQA model without considering these important factors. In this paper, the proposed model initially analyzes the image to extract text and identify scene objects. It then comprehends the question and mines relationships among the question, OCRed text, and scene objects, ultimately generating an answer through relational reasoning by conducting semantic and positional attention. Our decoder with attention map loss enables prediction of complex answers and handles dynamic vocabularies, reducing decoding space. It outperforms softmax-based cross entropy loss in accuracy and efficiency by accommodating varying vocabulary sizes. We evaluated our model's performance on the TextVQA dataset and achieved an accuracy of 53.91% on the validation set and 53.98% on the test set. Moreover, on the ST-VQA dataset, our model obtained ANLS scores of 0.699 on the validation set and 0.692 on the test set.

查看原文本刊更多论文

利用关系推理、注意力和动态词汇整合加强场景-文本视觉问题解答

视觉问题解答（VQA）是计算机视觉领域一项极具挑战性的任务。最近，人们对基于文本的 VQA 任务越来越感兴趣，强调文本信息对于更好地理解图像的重要作用。有效利用图像中的文本信息是这项任务取得成功的关键。然而，现有的方法往往忽略了上下文信息，忽视了场景文本标记与图像对象之间的关系。它们只是将从图像中挖掘出的场景文本标记纳入 VQA 模型，而没有考虑这些重要因素。本文提出的模型首先分析图像，提取文本并识别场景对象。然后，它理解问题并挖掘问题、OCR 文本和场景对象之间的关系，最终通过语义和位置注意力的关系推理生成答案。我们的解码器具有注意力图损失功能，能够预测复杂的答案，并处理动态词汇，从而减少解码空间。通过适应不同的词汇量，它在准确性和效率方面都优于基于软最大交叉熵损失的解码器。我们在 TextVQA 数据集上评估了模型的性能，验证集的准确率为 53.91%，测试集的准确率为 53.98%。此外，在 ST-VQA 数据集上，我们的模型在验证集上获得了 0.699 的 ANLS 分数，在测试集上获得了 0.692 的 ANLS 分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Intelligence 工程技术-计算机：人工智能

CiteScore

6.90

自引率

3.60%

发文量

审稿时长

>12 weeks

期刊介绍： This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.