FastVDT: Fast Transformer With Optimised Attention Masks and Positional Encoding for Visual Dialogue

IF 1.3 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision Pub Date : 2025-05-29 DOI:10.1049/cvi2.70022

Qiangqiang He, Shuwei Qian, Chongjun Wang

{"title":"FastVDT: Fast Transformer With Optimised Attention Masks and Positional Encoding for Visual Dialogue","authors":"Qiangqiang He, Shuwei Qian, Chongjun Wang","doi":"10.1049/cvi2.70022","DOIUrl":null,"url":null,"abstract":"<p>The visual dialogue task requires computers to comprehend image content and preceding question-and-answer history to accurately answer related questions, with each round of dialogue providing the necessary historical context for subsequent interactions. Existing research typically processes multiple questions related to a single image as independent samples, which results in redundant modelling of the images and their captions and substantially increases computational costs. To address the challenges above, we introduce a fast transformer for visual dialogue, termed FastVDT, which utilises novel attention masks and continuous positional encoding. FastVDT models multiple image-related questions as an integrated entity, accurately processing prior conversation history in each dialogue round while predicting answers to multiple questions. Our method effectively captures the interrelations among questions and significantly reduces computational overhead. Experimental results demonstrate that our method delivers outstanding performance on the VisDial v0.9 and v1.0 datasets. FastVDT achieves comparable performance to VD-BERT and VU-BERT while reducing computational costs by 80% and 56%, respectively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70022","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.70022","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The visual dialogue task requires computers to comprehend image content and preceding question-and-answer history to accurately answer related questions, with each round of dialogue providing the necessary historical context for subsequent interactions. Existing research typically processes multiple questions related to a single image as independent samples, which results in redundant modelling of the images and their captions and substantially increases computational costs. To address the challenges above, we introduce a fast transformer for visual dialogue, termed FastVDT, which utilises novel attention masks and continuous positional encoding. FastVDT models multiple image-related questions as an integrated entity, accurately processing prior conversation history in each dialogue round while predicting answers to multiple questions. Our method effectively captures the interrelations among questions and significantly reduces computational overhead. Experimental results demonstrate that our method delivers outstanding performance on the VisDial v0.9 and v1.0 datasets. FastVDT achieves comparable performance to VD-BERT and VU-BERT while reducing computational costs by 80% and 56%, respectively.

Abstract Image

查看原文本刊更多论文

FastVDT：快速变压器与优化的注意力面具和位置编码的视觉对话

视觉对话任务需要计算机理解图像内容和之前的问答历史，以准确回答相关问题，每一轮对话都为后续互动提供必要的历史背景。现有的研究通常将与一张图像相关的多个问题作为独立的样本进行处理，这导致了图像及其标题的冗余建模，大大增加了计算成本。为了解决上述挑战，我们引入了一种用于视觉对话的快速转换器，称为FastVDT，它利用了新颖的注意力掩模和连续位置编码。FastVDT将多个与图像相关的问题建模为一个完整的实体，在每个对话轮中准确地处理之前的对话历史，同时预测多个问题的答案。我们的方法有效地捕获了问题之间的相互关系，大大减少了计算开销。实验结果表明，该方法在VisDial v0.9和v1.0数据集上具有出色的性能。FastVDT实现了与VD-BERT和VU-BERT相当的性能，同时将计算成本分别降低了80%和56%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Computer Vision 工程技术-工程：电子与电气

CiteScore

3.30

自引率

11.80%

发文量

审稿时长

3.4 months

期刊介绍： IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf