Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors

IEEE transactions on artificial intelligence Pub Date : 2026-04-01 Epub Date: 2025-10-10 DOI:10.1109/TAI.2025.3612905

Dario Zanca;Andrea Zugarini;Simon Dietz;Thomas R. Altstidl;Mark A. Turban Ndjeuha;Moumita Chakraborty;Naga Venkata Sai Jitin Jami;Leo Schwinn;Bjoern M. Eskofier

{"title":"Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors","authors":"Dario Zanca;Andrea Zugarini;Simon Dietz;Thomas R. Altstidl;Mark A. Turban Ndjeuha;Moumita Chakraborty;Naga Venkata Sai Jitin Jami;Leo Schwinn;Bjoern M. Eskofier","doi":"10.1109/TAI.2025.3612905","DOIUrl":null,"url":null,"abstract":"Understanding human attention mechanisms is crucial for advancing both vision science and artificial intelligence. While numerous computational models of free-viewing have been proposed, less is known about the mechanisms underlying task-driven image exploration. To address this gap, we introduce NevaClip, a novel zero-shot method for predicting visual scanpaths. NevaClip leverages contrastive language-image pretrained (CLIP) models in conjunction with human-inspired neural visual attention (NeVA) algorithms. By aligning the representation of foveated visual stimuli with associated captions, NevaClip uses gradient-driven visual exploration to generate scanpaths that simulate human attention. We also present CapMIT1003, a new dataset comprising captions and click-contingent image explorations collected from participants engaged in a captioning task. Based on the established MIT1003 benchmark, which includes eye-tracking data from free-viewing conditions, CapMIT1003 provides a valuable resource for studying human attention across both free-viewing and task-driven contexts. Additionally, we demonstrate NevaClip’s performance on the publicly available AiR-D dataset, which includes visual question answering (VQA) tasks. Experimental results show that NevaClip outperforms existing unsupervised computational models in scanpath plausibility across captioning, VQA, and free-viewing tasks. Furthermore, we demonstrate that NevaClip’s performance is sensitive to caption accuracy, with misleading captions leading to inaccurate scanpath behaviors. This underscores the importance of caption guidance in attention prediction and highlights NevaClip’s potential to advance our understanding of task-driven human attention mechanisms. Together, NevaClip and CapMIT1003 offer significant contributions to the field, providing new tools for studying and simulating human visual attention.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2157-2170"},"PeriodicalIF":0.0000,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11199898/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Understanding human attention mechanisms is crucial for advancing both vision science and artificial intelligence. While numerous computational models of free-viewing have been proposed, less is known about the mechanisms underlying task-driven image exploration. To address this gap, we introduce NevaClip, a novel zero-shot method for predicting visual scanpaths. NevaClip leverages contrastive language-image pretrained (CLIP) models in conjunction with human-inspired neural visual attention (NeVA) algorithms. By aligning the representation of foveated visual stimuli with associated captions, NevaClip uses gradient-driven visual exploration to generate scanpaths that simulate human attention. We also present CapMIT1003, a new dataset comprising captions and click-contingent image explorations collected from participants engaged in a captioning task. Based on the established MIT1003 benchmark, which includes eye-tracking data from free-viewing conditions, CapMIT1003 provides a valuable resource for studying human attention across both free-viewing and task-driven contexts. Additionally, we demonstrate NevaClip’s performance on the publicly available AiR-D dataset, which includes visual question answering (VQA) tasks. Experimental results show that NevaClip outperforms existing unsupervised computational models in scanpath plausibility across captioning, VQA, and free-viewing tasks. Furthermore, we demonstrate that NevaClip’s performance is sensitive to caption accuracy, with misleading captions leading to inaccurate scanpath behaviors. This underscores the importance of caption guidance in attention prediction and highlights NevaClip’s potential to advance our understanding of task-driven human attention mechanisms. Together, NevaClip and CapMIT1003 offer significant contributions to the field, providing new tools for studying and simulating human visual attention.

查看原文本刊更多论文

对比语言图像预训练模型是零射击人类扫描路径预测器

理解人类的注意力机制对于推进视觉科学和人工智能至关重要。虽然已经提出了许多自由观看的计算模型，但对任务驱动的图像探索的机制知之甚少。为了解决这一差距，我们引入了NevaClip，一种新的零射击方法来预测视觉扫描路径。NevaClip利用对比语言图像预训练（CLIP）模型与人类启发的神经视觉注意（NeVA）算法相结合。通过将注视点视觉刺激的表示与相关字幕对齐，NevaClip使用梯度驱动的视觉探索来生成模拟人类注意力的扫描路径。我们还提出了CapMIT1003，这是一个新的数据集，包括从参与字幕任务的参与者收集的标题和点击相关图像探索。基于已建立的MIT1003基准，其中包括自由观看条件下的眼动追踪数据，CapMIT1003为研究自由观看和任务驱动上下文中的人类注意力提供了宝贵的资源。此外，我们还展示了NevaClip在公开可用的AiR-D数据集上的性能，其中包括视觉问答（VQA）任务。实验结果表明，NevaClip在字幕、VQA和自由观看任务的扫描路径合理性方面优于现有的无监督计算模型。此外，我们证明了NevaClip的性能对标题准确性很敏感，误导性的标题会导致不准确的扫描路径行为。这强调了标题指导在注意力预测中的重要性，并强调了NevaClip在促进我们对任务驱动的人类注意力机制的理解方面的潜力。NevaClip和CapMIT1003共同为该领域做出了重大贡献，为研究和模拟人类视觉注意力提供了新的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on artificial intelligence

CiteScore

7.70

自引率

0.00%

发文量