{"title":"Mimicking human attention in driving scenarios for enhanced Visual Question Answering: Insights from eye-tracking and the human attention filter","authors":"Kaavya Rekanar , Martin J. Hayes , Ciarán Eising","doi":"10.1016/j.iswa.2025.200578","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Question Answering (VQA) models serve a critical role in interpreting visual data and responding to textual queries, particularly within the domain of autonomous driving. These models enhance situational awareness and enable naturalistic interaction between passengers and vehicle systems. However, existing VQA architectures often underperform in driving contexts due to their generic design and lack of alignment with domain-specific perceptual cues. This study introduces a targeted enhancement strategy based on the integration of human visual attention patterns into VQA systems. The proposed approach investigates visual subjectivity by analysing human responses and gaze behaviours captured through an eye-tracking experiment conducted in a realistic driving scenario. This method enables the direct observation of authentic attention patterns and mitigates the limitations introduced by subjective self-reporting. From these findings, a Human Attention Filter (HAF) is constructed to selectively preserve task-relevant features while suppressing visually distracting but semantically irrelevant content. Three VQA models – LXMERT, ViLBERT, and ViLT – are evaluated to demonstrate the adaptability and impact of HAF across different visual representation strategies, including region-based and patch-based architectures. Case studies involving LXMERT and ViLBERT are conducted to assess the integration of the HAF within region-based multimodal pipelines, showing measurable improvements in performance and alignment with human-like attention. Quantitative analysis reveals statistically significant performance trends correlated with driving experience, highlighting cognitive variability among human participants and informing model interpretability. In addition, failure cases are examined to identify potential limitations introduced by attention filtering, offering critical insight into the boundaries of gaze-guided model alignment.The findings validate the effectiveness of human-informed filtering for improving both accuracy and transparency in autonomous VQA systems, and present HAF as a sustainable, cognitively aligned strategy for advancing trustworthy AI in real-world environments.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"28 ","pages":"Article 200578"},"PeriodicalIF":4.3000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305325001048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Visual Question Answering (VQA) models serve a critical role in interpreting visual data and responding to textual queries, particularly within the domain of autonomous driving. These models enhance situational awareness and enable naturalistic interaction between passengers and vehicle systems. However, existing VQA architectures often underperform in driving contexts due to their generic design and lack of alignment with domain-specific perceptual cues. This study introduces a targeted enhancement strategy based on the integration of human visual attention patterns into VQA systems. The proposed approach investigates visual subjectivity by analysing human responses and gaze behaviours captured through an eye-tracking experiment conducted in a realistic driving scenario. This method enables the direct observation of authentic attention patterns and mitigates the limitations introduced by subjective self-reporting. From these findings, a Human Attention Filter (HAF) is constructed to selectively preserve task-relevant features while suppressing visually distracting but semantically irrelevant content. Three VQA models – LXMERT, ViLBERT, and ViLT – are evaluated to demonstrate the adaptability and impact of HAF across different visual representation strategies, including region-based and patch-based architectures. Case studies involving LXMERT and ViLBERT are conducted to assess the integration of the HAF within region-based multimodal pipelines, showing measurable improvements in performance and alignment with human-like attention. Quantitative analysis reveals statistically significant performance trends correlated with driving experience, highlighting cognitive variability among human participants and informing model interpretability. In addition, failure cases are examined to identify potential limitations introduced by attention filtering, offering critical insight into the boundaries of gaze-guided model alignment.The findings validate the effectiveness of human-informed filtering for improving both accuracy and transparency in autonomous VQA systems, and present HAF as a sustainable, cognitively aligned strategy for advancing trustworthy AI in real-world environments.