ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality.

IEEE transactions on visualization and computer graphics Pub Date : 2025-03-12 DOI:10.1109/TVCG.2025.3549147

Yanming Xiu, Tim Scargill, Maria Gorlatova

{"title":"ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality.","authors":"Yanming Xiu, Tim Scargill, Maria Gorlatova","doi":"10.1109/TVCG.2025.3549147","DOIUrl":null,"url":null,"abstract":"<p><p>In Augmented Reality (AR), virtual content enhances user experience by providing additional information. However, improperly positioned or designed virtual content can be detrimental to task performance, as it can impair users' ability to accurately interpret real-world information. In this paper we examine two types of task-detrimental virtual content: obstruction attacks, in which virtual content prevents users from seeing real-world objects, and information manipulation attacks, in which virtual content interferes with users' ability to accurately interpret real-world information. We provide a mathematical framework to characterize these attacks and create a custom open-source dataset for attack evaluation. To address these attacks, we introduce ViDDAR (Vision language model-based Task-Detrimental content Detector for Augmented Reality), a comprehensive full-reference system that leverages Vision Language Models (VLMs) and advanced deep learning techniques to monitor and evaluate virtual content in AR environments, employing a user-edge-cloud architecture to balance performance with low latency. To the best of our knowledge, ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings. Our evaluation results demonstrate that ViDDAR effectively understands complex scenes and detects task-detrimental content, achieving up to 92.15% obstruction detection accuracy with a detection latency of 533 ms, and an 82.46% information manipulation content detection accuracy with a latency of 9.62 s.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3549147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In Augmented Reality (AR), virtual content enhances user experience by providing additional information. However, improperly positioned or designed virtual content can be detrimental to task performance, as it can impair users' ability to accurately interpret real-world information. In this paper we examine two types of task-detrimental virtual content: obstruction attacks, in which virtual content prevents users from seeing real-world objects, and information manipulation attacks, in which virtual content interferes with users' ability to accurately interpret real-world information. We provide a mathematical framework to characterize these attacks and create a custom open-source dataset for attack evaluation. To address these attacks, we introduce ViDDAR (Vision language model-based Task-Detrimental content Detector for Augmented Reality), a comprehensive full-reference system that leverages Vision Language Models (VLMs) and advanced deep learning techniques to monitor and evaluate virtual content in AR environments, employing a user-edge-cloud architecture to balance performance with low latency. To the best of our knowledge, ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings. Our evaluation results demonstrate that ViDDAR effectively understands complex scenes and detects task-detrimental content, achieving up to 92.15% obstruction detection accuracy with a detection latency of 533 ms, and an 82.46% information manipulation content detection accuracy with a latency of 9.62 s.

查看原文本刊更多论文

ViDDAR：基于视觉语言模型的增强现实任务有害内容检测。

在增强现实（AR）中，虚拟内容通过提供附加信息来增强用户体验。然而，不正确定位或设计的虚拟内容可能会损害任务性能，因为它会损害用户准确解释真实世界信息的能力。在本文中，我们研究了两种类型的任务有害的虚拟内容：阻碍攻击，其中虚拟内容阻止用户看到现实世界的对象，以及信息操纵攻击，其中虚拟内容干扰用户准确解释现实世界信息的能力。我们提供了一个数学框架来描述这些攻击，并创建了一个用于攻击评估的自定义开源数据集。为了解决这些攻击，我们引入了ViDDAR（基于视觉语言模型的增强现实任务有害内容检测器），这是一个全面的全参考系统，利用视觉语言模型（VLMs）和先进的深度学习技术来监控和评估AR环境中的虚拟内容，采用用户边缘云架构来平衡性能和低延迟。据我们所知，ViDDAR是第一个在AR设置中使用vlm来检测任务有害内容的系统。我们的评估结果表明，ViDDAR可以有效地理解复杂场景并检测任务有害内容，障碍物检测准确率高达92.15%，检测延迟为533 ms，信息操纵内容检测准确率高达82.46%，检测延迟为9.62 s。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on visualization and computer graphics

自引率

0.00%

发文量