{"title":"看到什么是重要的:在vr模拟的驾驶事故预测中,人类和人工智能之间的注意力(错误)一致。","authors":"Hoe Sung Ryu, Uijong Ju, Christian Wallraven","doi":"10.1109/TVCG.2025.3616811","DOIUrl":null,"url":null,"abstract":"<p><p>This study explores how human and AI visual attention differ in a short-term prediction task, particularly in the moments before an accident is about to happen. Since real-world studies of this kind would pose ethical and safety risks, we employed virtual reality (VR) to simulate an accident scenario. In the scenario, the driver approaches a fork in the road, knowing that one path would lead off a cliff crashing the car fatally-as the fork comes closer, the other, safe, path is suddenly blocked by trees, forcing the driver to make a split-second decision where to go. A total of $N = 71$ drivers completed the task, and we asked another $N = 30$ observers to watch short video clips leading up to the final event and to predict which way the driver would take. We then compared both prediction accuracy as well as attention patterns-how focus is distributed across objects-with AI systems, including vision language models (VLMs) and vision-only models. We found that overall, prediction performance increased as the accident time point approached; interestingly, humans fared better than AI systems overall except for the final time period just before the event. We also found that humans adapted their attention dynamically, shifting focus to important scene elements before an event, whereas AI attention remained static, overlooking key details of the scene. Importantly, as the accident time point approached, human-AI attentional alignment decreased, even though both types of models improved in prediction accuracy. Despite distinct temporal trajectories-vision-only models declining from an early advantage and VLMs peaking in the middle-both models achieved low to zero alignment with human attention. These findings highlight a critical dissociation: AI models make accurate predictions, but rely on visual strategies diverging from human processing, underscoring a gap between explainability and task performance.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":6.5000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Seeing What Matters: Attentional (Mis-)Alignment Between Humans and AI in VR-Simulated Prediction of Driving Accidents.\",\"authors\":\"Hoe Sung Ryu, Uijong Ju, Christian Wallraven\",\"doi\":\"10.1109/TVCG.2025.3616811\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This study explores how human and AI visual attention differ in a short-term prediction task, particularly in the moments before an accident is about to happen. Since real-world studies of this kind would pose ethical and safety risks, we employed virtual reality (VR) to simulate an accident scenario. In the scenario, the driver approaches a fork in the road, knowing that one path would lead off a cliff crashing the car fatally-as the fork comes closer, the other, safe, path is suddenly blocked by trees, forcing the driver to make a split-second decision where to go. A total of $N = 71$ drivers completed the task, and we asked another $N = 30$ observers to watch short video clips leading up to the final event and to predict which way the driver would take. We then compared both prediction accuracy as well as attention patterns-how focus is distributed across objects-with AI systems, including vision language models (VLMs) and vision-only models. We found that overall, prediction performance increased as the accident time point approached; interestingly, humans fared better than AI systems overall except for the final time period just before the event. We also found that humans adapted their attention dynamically, shifting focus to important scene elements before an event, whereas AI attention remained static, overlooking key details of the scene. Importantly, as the accident time point approached, human-AI attentional alignment decreased, even though both types of models improved in prediction accuracy. Despite distinct temporal trajectories-vision-only models declining from an early advantage and VLMs peaking in the middle-both models achieved low to zero alignment with human attention. These findings highlight a critical dissociation: AI models make accurate predictions, but rely on visual strategies diverging from human processing, underscoring a gap between explainability and task performance.</p>\",\"PeriodicalId\":94035,\"journal\":{\"name\":\"IEEE transactions on visualization and computer graphics\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on visualization and computer graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TVCG.2025.3616811\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3616811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Seeing What Matters: Attentional (Mis-)Alignment Between Humans and AI in VR-Simulated Prediction of Driving Accidents.
This study explores how human and AI visual attention differ in a short-term prediction task, particularly in the moments before an accident is about to happen. Since real-world studies of this kind would pose ethical and safety risks, we employed virtual reality (VR) to simulate an accident scenario. In the scenario, the driver approaches a fork in the road, knowing that one path would lead off a cliff crashing the car fatally-as the fork comes closer, the other, safe, path is suddenly blocked by trees, forcing the driver to make a split-second decision where to go. A total of $N = 71$ drivers completed the task, and we asked another $N = 30$ observers to watch short video clips leading up to the final event and to predict which way the driver would take. We then compared both prediction accuracy as well as attention patterns-how focus is distributed across objects-with AI systems, including vision language models (VLMs) and vision-only models. We found that overall, prediction performance increased as the accident time point approached; interestingly, humans fared better than AI systems overall except for the final time period just before the event. We also found that humans adapted their attention dynamically, shifting focus to important scene elements before an event, whereas AI attention remained static, overlooking key details of the scene. Importantly, as the accident time point approached, human-AI attentional alignment decreased, even though both types of models improved in prediction accuracy. Despite distinct temporal trajectories-vision-only models declining from an early advantage and VLMs peaking in the middle-both models achieved low to zero alignment with human attention. These findings highlight a critical dissociation: AI models make accurate predictions, but rely on visual strategies diverging from human processing, underscoring a gap between explainability and task performance.