{"title":"Dual-level dynamic heterogeneous graph network for video question answering","authors":"Zefan Zhang, Yanhui Li, Weiqi Zhang, Tian Bai","doi":"10.1016/j.neunet.2025.108094","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, Video Question Answering (VideoQA) has garnered considerable research interest as a pivotal task within the realm of vision-language understanding. However, existing Video Question Answering datasets often lack sufficient entity and event information. Thus, the Vision Language Models (VLMs) struggle to complete intricate grounding and reasoning among multi-modal entities or events and heavily rely on language short-cut or irrelevant visual context. To address these challenges, we make improvements from both data and model perspectives. In terms of VideoQA data, we focus on supplementing the missing specific entities and events with the proposed event and entity augmentation strategies. Based on the augmented data, we propose a Dual-Level Dynamic Heterogeneous Graph Network (DDHG) for Video Question Answering. DDHG incorporates transformer layers to capture the dynamic temporal-spatial changes of visual entities. Then, DDHG establishes multi-modal semantic grounding ability between vision and text with entity-level and event-level heterogeneous graphs. Finally, the Dual-level Cross-modal Interaction Module integrates the dual-level features to predict correct answers. Our method not only significantly outperforms existing VideoQA models on two complex event-based benchmark datasets (Causal-VidQA and NExT-QA) but also demonstrates superior event content prediction ability over several state-of-the-art approaches.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"194 ","pages":"Article 108094"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009748","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, Video Question Answering (VideoQA) has garnered considerable research interest as a pivotal task within the realm of vision-language understanding. However, existing Video Question Answering datasets often lack sufficient entity and event information. Thus, the Vision Language Models (VLMs) struggle to complete intricate grounding and reasoning among multi-modal entities or events and heavily rely on language short-cut or irrelevant visual context. To address these challenges, we make improvements from both data and model perspectives. In terms of VideoQA data, we focus on supplementing the missing specific entities and events with the proposed event and entity augmentation strategies. Based on the augmented data, we propose a Dual-Level Dynamic Heterogeneous Graph Network (DDHG) for Video Question Answering. DDHG incorporates transformer layers to capture the dynamic temporal-spatial changes of visual entities. Then, DDHG establishes multi-modal semantic grounding ability between vision and text with entity-level and event-level heterogeneous graphs. Finally, the Dual-level Cross-modal Interaction Module integrates the dual-level features to predict correct answers. Our method not only significantly outperforms existing VideoQA models on two complex event-based benchmark datasets (Causal-VidQA and NExT-QA) but also demonstrates superior event content prediction ability over several state-of-the-art approaches.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.