看到什么是重要的：在vr模拟的驾驶事故预测中，人类和人工智能之间的注意力（错误）一致。

IF 6.5

IEEE transactions on visualization and computer graphics Pub Date : 2025-10-02 DOI:10.1109/TVCG.2025.3616811

Hoe Sung Ryu, Uijong Ju, Christian Wallraven

{"title":"看到什么是重要的：在vr模拟的驾驶事故预测中，人类和人工智能之间的注意力（错误）一致。","authors":"Hoe Sung Ryu, Uijong Ju, Christian Wallraven","doi":"10.1109/TVCG.2025.3616811","DOIUrl":null,"url":null,"abstract":"This study explores how human and AI visual attention differ in a short-term prediction task, particularly in the moments before an accident is about to happen. Since real-world studies of this kind would pose ethical and safety risks, we employed virtual reality (VR) to simulate an accident scenario. In the scenario, the driver approaches a fork in the road, knowing that one path would lead off a cliff crashing the car fatally-as the fork comes closer, the other, safe, path is suddenly blocked by trees, forcing the driver to make a split-second decision where to go. A total of $N = 71$ drivers completed the task, and we asked another $N = 30$ observers to watch short video clips leading up to the final event and to predict which way the driver would take. We then compared both prediction accuracy as well as attention patterns-how focus is distributed across objects-with AI systems, including vision language models (VLMs) and vision-only models. We found that overall, prediction performance increased as the accident time point approached; interestingly, humans fared better than AI systems overall except for the final time period just before the event. We also found that humans adapted their attention dynamically, shifting focus to important scene elements before an event, whereas AI attention remained static, overlooking key details of the scene. Importantly, as the accident time point approached, human-AI attentional alignment decreased, even though both types of models improved in prediction accuracy. Despite distinct temporal trajectories-vision-only models declining from an early advantage and VLMs peaking in the middle-both models achieved low to zero alignment with human attention. These findings highlight a critical dissociation: AI models make accurate predictions, but rely on visual strategies diverging from human processing, underscoring a gap between explainability and task performance.","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":6.5000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Seeing What Matters: Attentional (Mis-)Alignment Between Humans and AI in VR-Simulated Prediction of Driving Accidents.\",\"authors\":\"Hoe Sung Ryu, Uijong Ju, Christian Wallraven\",\"doi\":\"10.1109/TVCG.2025.3616811\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study explores how human and AI visual attention differ in a short-term prediction task, particularly in the moments before an accident is about to happen. Since real-world studies of this kind would pose ethical and safety risks, we employed virtual reality (VR) to simulate an accident scenario. In the scenario, the driver approaches a fork in the road, knowing that one path would lead off a cliff crashing the car fatally-as the fork comes closer, the other, safe, path is suddenly blocked by trees, forcing the driver to make a split-second decision where to go. A total of $N = 71$ drivers completed the task, and we asked another $N = 30$ observers to watch short video clips leading up to the final event and to predict which way the driver would take. We then compared both prediction accuracy as well as attention patterns-how focus is distributed across objects-with AI systems, including vision language models (VLMs) and vision-only models. We found that overall, prediction performance increased as the accident time point approached; interestingly, humans fared better than AI systems overall except for the final time period just before the event. We also found that humans adapted their attention dynamically, shifting focus to important scene elements before an event, whereas AI attention remained static, overlooking key details of the scene. Importantly, as the accident time point approached, human-AI attentional alignment decreased, even though both types of models improved in prediction accuracy. Despite distinct temporal trajectories-vision-only models declining from an early advantage and VLMs peaking in the middle-both models achieved low to zero alignment with human attention. These findings highlight a critical dissociation: AI models make accurate predictions, but rely on visual strategies diverging from human processing, underscoring a gap between explainability and task performance.\",\"PeriodicalId\":94035,\"journal\":{\"name\":\"IEEE transactions on visualization and computer graphics\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on visualization and computer graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TVCG.2025.3616811\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3616811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

这项研究探讨了人类和人工智能的视觉注意力在短期预测任务中的不同之处，特别是在事故即将发生之前的时刻。由于这种现实世界的研究可能会带来伦理和安全风险，我们采用虚拟现实（VR）来模拟事故场景。在这个场景中，司机接近一个岔路口，他知道其中一条路可能会把车撞下悬崖——当岔路口越来越近时，另一条安全的路突然被树木挡住了，迫使司机在瞬间做出决定，该往哪里走。总共有$N = 71的司机完成了这项任务，我们让另外$N = 30的观察者观看了导致最后事件的短视频片段，并预测司机会走哪条路。然后，我们比较了人工智能系统的预测准确性和注意力模式（焦点如何在物体之间分布），包括视觉语言模型（VLMs）和视觉模型。我们发现，总体而言，预测性能随着事故时间点的接近而提高；有趣的是，除了比赛前的最后一段时间外，人类总体上比人工智能系统表现得更好。我们还发现，人类会动态调整自己的注意力，在事件发生前将注意力转移到重要的场景元素上，而人工智能的注意力保持静态，忽略场景的关键细节。重要的是，随着事故时间点的临近，人类和人工智能的注意力一致性下降，尽管两种模型的预测精度都有所提高。尽管有不同的时间轨迹——纯视觉模型从早期的优势下降，VLMs在中期达到顶峰——但这两个模型与人类注意力的一致性都很低，甚至为零。这些发现强调了一个关键的分离：人工智能模型做出了准确的预测，但依赖于与人类处理不同的视觉策略，强调了可解释性和任务表现之间的差距。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Seeing What Matters: Attentional (Mis-)Alignment Between Humans and AI in VR-Simulated Prediction of Driving Accidents.

This study explores how human and AI visual attention differ in a short-term prediction task, particularly in the moments before an accident is about to happen. Since real-world studies of this kind would pose ethical and safety risks, we employed virtual reality (VR) to simulate an accident scenario. In the scenario, the driver approaches a fork in the road, knowing that one path would lead off a cliff crashing the car fatally-as the fork comes closer, the other, safe, path is suddenly blocked by trees, forcing the driver to make a split-second decision where to go. A total of $N = 71$ drivers completed the task, and we asked another $N = 30$ observers to watch short video clips leading up to the final event and to predict which way the driver would take. We then compared both prediction accuracy as well as attention patterns-how focus is distributed across objects-with AI systems, including vision language models (VLMs) and vision-only models. We found that overall, prediction performance increased as the accident time point approached; interestingly, humans fared better than AI systems overall except for the final time period just before the event. We also found that humans adapted their attention dynamically, shifting focus to important scene elements before an event, whereas AI attention remained static, overlooking key details of the scene. Importantly, as the accident time point approached, human-AI attentional alignment decreased, even though both types of models improved in prediction accuracy. Despite distinct temporal trajectories-vision-only models declining from an early advantage and VLMs peaking in the middle-both models achieved low to zero alignment with human attention. These findings highlight a critical dissociation: AI models make accurate predictions, but rely on visual strategies diverging from human processing, underscoring a gap between explainability and task performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on visualization and computer graphics

自引率

0.00%

发文量