Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu
{"title":"Context-Dependent Interactable Graphical User Interface Element Detection for VR Applications","authors":"Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu","doi":"arxiv-2409.10811","DOIUrl":null,"url":null,"abstract":"In recent years, Virtual Reality (VR) has emerged as a transformative\ntechnology, offering users immersive and interactive experiences across\ndiversified virtual environments. Users can interact with VR apps through\ninteractable GUI elements (IGEs) on the stereoscopic three-dimensional (3D)\ngraphical user interface (GUI). The accurate recognition of these IGEs is\ninstrumental, serving as the foundation of many software engineering tasks,\nincluding automated testing and effective GUI search. The most recent IGE\ndetection approaches for 2D mobile apps typically train a supervised object\ndetection model based on a large-scale manually-labeled GUI dataset, usually\nwith a pre-defined set of clickable GUI element categories like buttons and\nspinners. Such approaches can hardly be applied to IGE detection in VR apps,\ndue to a multitude of challenges including complexities posed by\nopen-vocabulary and heterogeneous IGE categories, intricacies of\ncontext-sensitive interactability, and the necessities of precise spatial\nperception and visual-semantic alignment for accurate IGE detection results.\nThus, it is necessary to embark on the IGE research tailored to VR apps. In\nthis paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI\nElemeNT dEtection framework for virtual Reality apps, named Orienter. By\nimitating human behaviors, Orienter observes and understands the semantic\ncontexts of VR app scenes first, before performing the detection. The detection\nprocess is iterated within a feedback-directed validation and reflection loop.\nSpecifically, Orienter contains three components, including (1) Semantic\ncontext comprehension, (2) Reflection-directed IGE candidate detection, and (3)\nContext-sensitive interactability classification. Extensive experiments on the\ndataset demonstrate that Orienter is more effective than the state-of-the-art\nGUI element detection approaches.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, Virtual Reality (VR) has emerged as a transformative
technology, offering users immersive and interactive experiences across
diversified virtual environments. Users can interact with VR apps through
interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D)
graphical user interface (GUI). The accurate recognition of these IGEs is
instrumental, serving as the foundation of many software engineering tasks,
including automated testing and effective GUI search. The most recent IGE
detection approaches for 2D mobile apps typically train a supervised object
detection model based on a large-scale manually-labeled GUI dataset, usually
with a pre-defined set of clickable GUI element categories like buttons and
spinners. Such approaches can hardly be applied to IGE detection in VR apps,
due to a multitude of challenges including complexities posed by
open-vocabulary and heterogeneous IGE categories, intricacies of
context-sensitive interactability, and the necessities of precise spatial
perception and visual-semantic alignment for accurate IGE detection results.
Thus, it is necessary to embark on the IGE research tailored to VR apps. In
this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI
ElemeNT dEtection framework for virtual Reality apps, named Orienter. By
imitating human behaviors, Orienter observes and understands the semantic
contexts of VR app scenes first, before performing the detection. The detection
process is iterated within a feedback-directed validation and reflection loop.
Specifically, Orienter contains three components, including (1) Semantic
context comprehension, (2) Reflection-directed IGE candidate detection, and (3)
Context-sensitive interactability classification. Extensive experiments on the
dataset demonstrate that Orienter is more effective than the state-of-the-art
GUI element detection approaches.