{"title":"Modular AR Framework for Vision-Language Tasks","authors":"Robin Fischer, Tzu-Hsuan Weng, L. Fu","doi":"10.1145/3439133.3439142","DOIUrl":null,"url":null,"abstract":"Mixed / augmented reality systems have become more and more sophisticated in recent years. However, they still lack any ability to reason about the surrounding world. On the other hand, computer vision research has made many advancements towards a more human-like reasoning process. This paper aims to bridge these 2 research areas by implementing a modular framework which interconnects an AR application with a deep learning based vision model. Finally, a few potential use cases of the proposed system are showcased. The developed framework allows the application to utilize a variety of Vision-Language (V+L) models, to gain additional understanding about the surrounding environment. The system is designed to be modular and expandable. It is able to connect any number of Python processes of the V+L models to Unity apps using AR technology. The system was evaluated in our university's smart home lab based on daily life use cases. With a further extension of the framework by additional downstream tasks provided by V+L models and other computer vision systems, this framework should find wider adoption in AR applications. The increasing ability of applications to comprehend visual common sense and natural conversations would enable more intuitive interactions with the user, who could perceive his device more as a virtual assistant and companion.","PeriodicalId":291985,"journal":{"name":"2020 4th International Conference on Artificial Intelligence and Virtual Reality","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 4th International Conference on Artificial Intelligence and Virtual Reality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3439133.3439142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Mixed / augmented reality systems have become more and more sophisticated in recent years. However, they still lack any ability to reason about the surrounding world. On the other hand, computer vision research has made many advancements towards a more human-like reasoning process. This paper aims to bridge these 2 research areas by implementing a modular framework which interconnects an AR application with a deep learning based vision model. Finally, a few potential use cases of the proposed system are showcased. The developed framework allows the application to utilize a variety of Vision-Language (V+L) models, to gain additional understanding about the surrounding environment. The system is designed to be modular and expandable. It is able to connect any number of Python processes of the V+L models to Unity apps using AR technology. The system was evaluated in our university's smart home lab based on daily life use cases. With a further extension of the framework by additional downstream tasks provided by V+L models and other computer vision systems, this framework should find wider adoption in AR applications. The increasing ability of applications to comprehend visual common sense and natural conversations would enable more intuitive interactions with the user, who could perceive his device more as a virtual assistant and companion.