Context-Dependent Interactable Graphical User Interface Element Detection for VR Applications

arXiv - CS - Software Engineering Pub Date : 2024-09-17 DOI:arxiv-2409.10811

Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu

{"title":"Context-Dependent Interactable Graphical User Interface Element Detection for VR Applications","authors":"Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu","doi":"arxiv-2409.10811","DOIUrl":null,"url":null,"abstract":"In recent years, Virtual Reality (VR) has emerged as a transformative\ntechnology, offering users immersive and interactive experiences across\ndiversified virtual environments. Users can interact with VR apps through\ninteractable GUI elements (IGEs) on the stereoscopic three-dimensional (3D)\ngraphical user interface (GUI). The accurate recognition of these IGEs is\ninstrumental, serving as the foundation of many software engineering tasks,\nincluding automated testing and effective GUI search. The most recent IGE\ndetection approaches for 2D mobile apps typically train a supervised object\ndetection model based on a large-scale manually-labeled GUI dataset, usually\nwith a pre-defined set of clickable GUI element categories like buttons and\nspinners. Such approaches can hardly be applied to IGE detection in VR apps,\ndue to a multitude of challenges including complexities posed by\nopen-vocabulary and heterogeneous IGE categories, intricacies of\ncontext-sensitive interactability, and the necessities of precise spatial\nperception and visual-semantic alignment for accurate IGE detection results.\nThus, it is necessary to embark on the IGE research tailored to VR apps. In\nthis paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI\nElemeNT dEtection framework for virtual Reality apps, named Orienter. By\nimitating human behaviors, Orienter observes and understands the semantic\ncontexts of VR app scenes first, before performing the detection. The detection\nprocess is iterated within a feedback-directed validation and reflection loop.\nSpecifically, Orienter contains three components, including (1) Semantic\ncontext comprehension, (2) Reflection-directed IGE candidate detection, and (3)\nContext-sensitive interactability classification. Extensive experiments on the\ndataset demonstrate that Orienter is more effective than the state-of-the-art\nGUI element detection approaches.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, Virtual Reality (VR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with VR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in VR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to VR apps. In this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter. By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection. The detection process is iterated within a feedback-directed validation and reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension, (2) Reflection-directed IGE candidate detection, and (3) Context-sensitive interactability classification. Extensive experiments on the dataset demonstrate that Orienter is more effective than the state-of-the-art GUI element detection approaches.

查看原文本刊更多论文

用于虚拟现实应用的与上下文相关的可交互图形用户界面元素检测

近年来，虚拟现实（VR）已成为一种变革性技术，为用户提供了跨越多样化虚拟环境的沉浸式交互体验。用户可以通过立体三维（3D）图形用户界面（GUI）上的可交互 GUI 元素（IGE）与 VR 应用程序进行交互。准确识别这些 IGE 至关重要，是许多软件工程任务（包括自动测试和有效的图形用户界面搜索）的基础。最新的 2D 移动应用程序 IGE 检测方法通常基于大规模手动标记的 GUI 数据集来训练监督对象检测模型，该数据集通常包含一组预定义的可点击 GUI 元素类别，如按钮和旋钮。这种方法很难应用于 VR 应用中的 IGE 检测，因为它面临诸多挑战，包括开放词汇和异构 IGE 类别带来的复杂性、上下文敏感交互性的错综复杂性，以及为获得准确的 IGE 检测结果而进行精确空间感知和视觉语义对齐的必要性。在本文中，我们提出了首个用于虚拟现实应用程序的零镜头文本敏感可交互图形检测框架，命名为Orienter。通过模仿人类行为，Orienter 首先观察并理解虚拟现实应用场景的语义上下文，然后再执行检测。具体来说，Orienter包含三个组件，包括（1）语义上下文理解；（2）反思导向的IGE候选检测；（3）上下文敏感的可交互性分类。在数据集上进行的大量实验证明，Orienter 比最先进的图形用户界面元素检测方法更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Software Engineering

自引率

0.00%

发文量