Abdulganiyu Abdu Yusuf , Chong Feng , Xianling Mao , Yunusa Haruna , Xinyan Li , Ramadhani Ally Duma
{"title":"Graph-enhanced visual representations and question-guided dual attention for visual question answering","authors":"Abdulganiyu Abdu Yusuf , Chong Feng , Xianling Mao , Yunusa Haruna , Xinyan Li , Ramadhani Ally Duma","doi":"10.1016/j.neucom.2024.128850","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Question Answering (VQA) has witnessed significant advancements recently, due to the application of deep learning in the field of vision-language research. Most current VQA models focus on merging visual and text features, but it is essential for these models to also consider the relationships between different parts of an image and use question information to highlight important features. This study proposes a method to enhance neighboring image region features and learn question-aware visual representations. First, we construct a region graph to represent spatial relationships between objects in the image. Then, graph convolutional network (GCN) is used to propagate information across neighboring regions, enriching each region’s feature representation by integrating contextual information. To capture long-range dependencies, the graph is enhanced with random walk with restart (RWR), enabling multi-hop reasoning across distant regions. Furthermore, a question-aware dual attention mechanism is introduced to further refine region features at both region and feature levels, ensuring that the model emphasizes key regions that are critical for answering the question. The enhanced region representations are then combined with the encoded question to predict an answer. Through extensive experiments on VQA benchmarks, the study demonstrates state-of-the-art performance by leveraging regional dependencies and question guidance. The integration of GCNs and random walks in the graph helps capture contextual information to focus visual attention selectively, resulting in significant improvements over existing methods on VQA 1.0 and VQA 2.0 benchmark datasets.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128850"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016217","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Visual Question Answering (VQA) has witnessed significant advancements recently, due to the application of deep learning in the field of vision-language research. Most current VQA models focus on merging visual and text features, but it is essential for these models to also consider the relationships between different parts of an image and use question information to highlight important features. This study proposes a method to enhance neighboring image region features and learn question-aware visual representations. First, we construct a region graph to represent spatial relationships between objects in the image. Then, graph convolutional network (GCN) is used to propagate information across neighboring regions, enriching each region’s feature representation by integrating contextual information. To capture long-range dependencies, the graph is enhanced with random walk with restart (RWR), enabling multi-hop reasoning across distant regions. Furthermore, a question-aware dual attention mechanism is introduced to further refine region features at both region and feature levels, ensuring that the model emphasizes key regions that are critical for answering the question. The enhanced region representations are then combined with the encoded question to predict an answer. Through extensive experiments on VQA benchmarks, the study demonstrates state-of-the-art performance by leveraging regional dependencies and question guidance. The integration of GCNs and random walks in the graph helps capture contextual information to focus visual attention selectively, resulting in significant improvements over existing methods on VQA 1.0 and VQA 2.0 benchmark datasets.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.