Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu
{"title":"Vman: visual-modified attention network for multimodal paradigms","authors":"Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu","doi":"10.1007/s00371-024-03563-4","DOIUrl":null,"url":null,"abstract":"<p>Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving the issues of introducing noise, insufficient information interaction, and obtaining more refined visual features during the image self-interaction of conventional Transformers is challenging. Therefore, this paper proposes a universal visual-modified attention network (VMAN) to address these problems. Specifically, VMAN optimizes the attention mechanism in Transformer, introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. Modulating image features with modified units to obtain more refined query features for subsequent interaction, filtering out noise information while enhancing dependency modeling and reasoning capabilities. Furthermore, two modified approaches have been designed: the weighted sum-based approach and the cross-attention-based approach. Finally, we conduct extensive experiments on VMAN across five benchmark datasets for two tasks (VQA, VG). The results indicate that VMAN achieves an accuracy of 70.99<span>\\(\\%\\)</span> on the VQA-v2 and makes a breakthrough of 74.41<span>\\(\\%\\)</span> on the RefCOCOg which involves more complex expressions. The results fully prove the rationality and effectiveness of VMAN. The code is available at https://github.com/79song/VMAN.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03563-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving the issues of introducing noise, insufficient information interaction, and obtaining more refined visual features during the image self-interaction of conventional Transformers is challenging. Therefore, this paper proposes a universal visual-modified attention network (VMAN) to address these problems. Specifically, VMAN optimizes the attention mechanism in Transformer, introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. Modulating image features with modified units to obtain more refined query features for subsequent interaction, filtering out noise information while enhancing dependency modeling and reasoning capabilities. Furthermore, two modified approaches have been designed: the weighted sum-based approach and the cross-attention-based approach. Finally, we conduct extensive experiments on VMAN across five benchmark datasets for two tasks (VQA, VG). The results indicate that VMAN achieves an accuracy of 70.99\(\%\) on the VQA-v2 and makes a breakthrough of 74.41\(\%\) on the RefCOCOg which involves more complex expressions. The results fully prove the rationality and effectiveness of VMAN. The code is available at https://github.com/79song/VMAN.