Mimicking human attention in driving scenarios for enhanced Visual Question Answering: Insights from eye-tracking and the human attention filter

IF 4.3

Intelligent Systems with Applications Pub Date : 2025-09-11 DOI:10.1016/j.iswa.2025.200578

Kaavya Rekanar , Martin J. Hayes , Ciarán Eising

{"title":"Mimicking human attention in driving scenarios for enhanced Visual Question Answering: Insights from eye-tracking and the human attention filter","authors":"Kaavya Rekanar , Martin J. Hayes , Ciarán Eising","doi":"10.1016/j.iswa.2025.200578","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Question Answering (VQA) models serve a critical role in interpreting visual data and responding to textual queries, particularly within the domain of autonomous driving. These models enhance situational awareness and enable naturalistic interaction between passengers and vehicle systems. However, existing VQA architectures often underperform in driving contexts due to their generic design and lack of alignment with domain-specific perceptual cues. This study introduces a targeted enhancement strategy based on the integration of human visual attention patterns into VQA systems. The proposed approach investigates visual subjectivity by analysing human responses and gaze behaviours captured through an eye-tracking experiment conducted in a realistic driving scenario. This method enables the direct observation of authentic attention patterns and mitigates the limitations introduced by subjective self-reporting. From these findings, a Human Attention Filter (HAF) is constructed to selectively preserve task-relevant features while suppressing visually distracting but semantically irrelevant content. Three VQA models – LXMERT, ViLBERT, and ViLT – are evaluated to demonstrate the adaptability and impact of HAF across different visual representation strategies, including region-based and patch-based architectures. Case studies involving LXMERT and ViLBERT are conducted to assess the integration of the HAF within region-based multimodal pipelines, showing measurable improvements in performance and alignment with human-like attention. Quantitative analysis reveals statistically significant performance trends correlated with driving experience, highlighting cognitive variability among human participants and informing model interpretability. In addition, failure cases are examined to identify potential limitations introduced by attention filtering, offering critical insight into the boundaries of gaze-guided model alignment.The findings validate the effectiveness of human-informed filtering for improving both accuracy and transparency in autonomous VQA systems, and present HAF as a sustainable, cognitively aligned strategy for advancing trustworthy AI in real-world environments.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"28 ","pages":"Article 200578"},"PeriodicalIF":4.3000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305325001048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Visual Question Answering (VQA) models serve a critical role in interpreting visual data and responding to textual queries, particularly within the domain of autonomous driving. These models enhance situational awareness and enable naturalistic interaction between passengers and vehicle systems. However, existing VQA architectures often underperform in driving contexts due to their generic design and lack of alignment with domain-specific perceptual cues. This study introduces a targeted enhancement strategy based on the integration of human visual attention patterns into VQA systems. The proposed approach investigates visual subjectivity by analysing human responses and gaze behaviours captured through an eye-tracking experiment conducted in a realistic driving scenario. This method enables the direct observation of authentic attention patterns and mitigates the limitations introduced by subjective self-reporting. From these findings, a Human Attention Filter (HAF) is constructed to selectively preserve task-relevant features while suppressing visually distracting but semantically irrelevant content. Three VQA models – LXMERT, ViLBERT, and ViLT – are evaluated to demonstrate the adaptability and impact of HAF across different visual representation strategies, including region-based and patch-based architectures. Case studies involving LXMERT and ViLBERT are conducted to assess the integration of the HAF within region-based multimodal pipelines, showing measurable improvements in performance and alignment with human-like attention. Quantitative analysis reveals statistically significant performance trends correlated with driving experience, highlighting cognitive variability among human participants and informing model interpretability. In addition, failure cases are examined to identify potential limitations introduced by attention filtering, offering critical insight into the boundaries of gaze-guided model alignment.The findings validate the effectiveness of human-informed filtering for improving both accuracy and transparency in autonomous VQA systems, and present HAF as a sustainable, cognitively aligned strategy for advancing trustworthy AI in real-world environments.

查看原文本刊更多论文

在驾驶场景中模拟人类注意力以增强视觉问答：来自眼动追踪和人类注意力过滤器的见解

视觉问答（VQA）模型在解释视觉数据和响应文本查询方面发挥着至关重要的作用，特别是在自动驾驶领域。这些模型增强了态势感知能力，并使乘客和车辆系统之间的自然互动成为可能。然而，现有的VQA架构由于其通用设计和缺乏与特定领域感知线索的一致性，在驱动环境中往往表现不佳。本研究介绍了一种基于将人类视觉注意模式整合到VQA系统中的目标增强策略。该方法通过分析在现实驾驶场景中进行的眼动追踪实验中捕获的人类反应和凝视行为来研究视觉主观性。这种方法可以直接观察真实的注意力模式，减轻主观自我报告带来的限制。基于这些发现，我们构建了一个人类注意过滤器（HAF）来选择性地保留任务相关的特征，同时抑制视觉上分散注意力但语义上不相关的内容。对三个VQA模型——LXMERT、ViLBERT和ViLT进行了评估，以展示HAF在不同视觉表示策略（包括基于区域和基于补丁的架构）中的适应性和影响。包括LXMERT和ViLBERT在内的案例研究进行了评估，以评估HAF在基于区域的多模态管道中的整合，显示出性能的可衡量改进，并与人类的注意力保持一致。定量分析揭示了与驾驶经验相关的统计显著性能趋势，突出了人类参与者之间的认知可变性，并为模型的可解释性提供了信息。此外，还研究了失败案例，以确定注意力过滤引入的潜在限制，为视线引导模型对齐的边界提供了关键的见解。研究结果验证了人类知情过滤在提高自主VQA系统的准确性和透明度方面的有效性，并将HAF作为一种可持续的、认知一致的策略，用于在现实环境中推进值得信赖的人工智能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent Systems with Applications

CiteScore

5.60

自引率

0.00%

发文量