利用关系推理、注意力和动态词汇整合加强场景-文本视觉问题解答

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Mayank Agrawal, Anand Singh Jalal, Himanshu Sharma
{"title":"利用关系推理、注意力和动态词汇整合加强场景-文本视觉问题解答","authors":"Mayank Agrawal,&nbsp;Anand Singh Jalal,&nbsp;Himanshu Sharma","doi":"10.1111/coin.12635","DOIUrl":null,"url":null,"abstract":"<p>Visual question answering (VQA) is a challenging task in computer vision. Recently, there has been a growing interest in text-based VQA tasks, emphasizing the important role of textual information for better understanding of images. Effectively utilizing text information within the image is crucial for achieving success in this task. However, existing approaches often overlook the contextual information and neglect to utilize the relationships between scene-text tokens and image objects. They simply incorporate the scene-text tokens mined from the image into the VQA model without considering these important factors. In this paper, the proposed model initially analyzes the image to extract text and identify scene objects. It then comprehends the question and mines relationships among the question, OCRed text, and scene objects, ultimately generating an answer through relational reasoning by conducting semantic and positional attention. Our decoder with attention map loss enables prediction of complex answers and handles dynamic vocabularies, reducing decoding space. It outperforms softmax-based cross entropy loss in accuracy and efficiency by accommodating varying vocabulary sizes. We evaluated our model's performance on the TextVQA dataset and achieved an accuracy of 53.91% on the validation set and 53.98% on the test set. Moreover, on the ST-VQA dataset, our model obtained ANLS scores of 0.699 on the validation set and 0.692 on the test set.</p>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration\",\"authors\":\"Mayank Agrawal,&nbsp;Anand Singh Jalal,&nbsp;Himanshu Sharma\",\"doi\":\"10.1111/coin.12635\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Visual question answering (VQA) is a challenging task in computer vision. Recently, there has been a growing interest in text-based VQA tasks, emphasizing the important role of textual information for better understanding of images. Effectively utilizing text information within the image is crucial for achieving success in this task. However, existing approaches often overlook the contextual information and neglect to utilize the relationships between scene-text tokens and image objects. They simply incorporate the scene-text tokens mined from the image into the VQA model without considering these important factors. In this paper, the proposed model initially analyzes the image to extract text and identify scene objects. It then comprehends the question and mines relationships among the question, OCRed text, and scene objects, ultimately generating an answer through relational reasoning by conducting semantic and positional attention. Our decoder with attention map loss enables prediction of complex answers and handles dynamic vocabularies, reducing decoding space. It outperforms softmax-based cross entropy loss in accuracy and efficiency by accommodating varying vocabulary sizes. We evaluated our model's performance on the TextVQA dataset and achieved an accuracy of 53.91% on the validation set and 53.98% on the test set. Moreover, on the ST-VQA dataset, our model obtained ANLS scores of 0.699 on the validation set and 0.692 on the test set.</p>\",\"PeriodicalId\":55228,\"journal\":{\"name\":\"Computational Intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/coin.12635\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.12635","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

视觉问题解答(VQA)是计算机视觉领域一项极具挑战性的任务。最近,人们对基于文本的 VQA 任务越来越感兴趣,强调文本信息对于更好地理解图像的重要作用。有效利用图像中的文本信息是这项任务取得成功的关键。然而,现有的方法往往忽略了上下文信息,忽视了场景文本标记与图像对象之间的关系。它们只是将从图像中挖掘出的场景文本标记纳入 VQA 模型,而没有考虑这些重要因素。本文提出的模型首先分析图像,提取文本并识别场景对象。然后,它理解问题并挖掘问题、OCR 文本和场景对象之间的关系,最终通过语义和位置注意力的关系推理生成答案。我们的解码器具有注意力图损失功能,能够预测复杂的答案,并处理动态词汇,从而减少解码空间。通过适应不同的词汇量,它在准确性和效率方面都优于基于软最大交叉熵损失的解码器。我们在 TextVQA 数据集上评估了模型的性能,验证集的准确率为 53.91%,测试集的准确率为 53.98%。此外,在 ST-VQA 数据集上,我们的模型在验证集上获得了 0.699 的 ANLS 分数,在测试集上获得了 0.692 的 ANLS 分数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration

Visual question answering (VQA) is a challenging task in computer vision. Recently, there has been a growing interest in text-based VQA tasks, emphasizing the important role of textual information for better understanding of images. Effectively utilizing text information within the image is crucial for achieving success in this task. However, existing approaches often overlook the contextual information and neglect to utilize the relationships between scene-text tokens and image objects. They simply incorporate the scene-text tokens mined from the image into the VQA model without considering these important factors. In this paper, the proposed model initially analyzes the image to extract text and identify scene objects. It then comprehends the question and mines relationships among the question, OCRed text, and scene objects, ultimately generating an answer through relational reasoning by conducting semantic and positional attention. Our decoder with attention map loss enables prediction of complex answers and handles dynamic vocabularies, reducing decoding space. It outperforms softmax-based cross entropy loss in accuracy and efficiency by accommodating varying vocabulary sizes. We evaluated our model's performance on the TextVQA dataset and achieved an accuracy of 53.91% on the validation set and 53.98% on the test set. Moreover, on the ST-VQA dataset, our model obtained ANLS scores of 0.699 on the validation set and 0.692 on the test set.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computational Intelligence
Computational Intelligence 工程技术-计算机:人工智能
CiteScore
6.90
自引率
3.60%
发文量
65
审稿时长
>12 weeks
期刊介绍: This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信