{"title":"基于多尺度时空注意力网络的沉浸式虚拟现实中的人类动作识别","authors":"Zhiyong Xiao, Yukun Chen, Xinlei Zhou, Mingwei He, Li Liu, Feng Yu, Minghua Jiang","doi":"10.1002/cav.2293","DOIUrl":null,"url":null,"abstract":"<p>Wearable human action recognition (HAR) has practical applications in daily life. However, traditional HAR methods solely focus on identifying user movements, lacking interactivity and user engagement. This paper proposes a novel immersive HAR method called MovPosVR. Virtual reality (VR) technology is employed to create realistic scenes and enhance the user experience. To improve the accuracy of user action recognition in immersive HAR, a multi-scale spatio-temporal attention network (MSSTANet) is proposed. The network combines the convolutional residual squeeze and excitation (CRSE) module with the multi-branch convolution and long short-term memory (MCLSTM) module to extract spatio-temporal features and automatically select relevant features from action signals. Additionally, a multi-head attention with shared linear mechanism (MHASLM) module is designed to facilitate information interaction, further enhancing feature extraction and improving accuracy. The MSSTANet network achieves superior performance, with accuracy rates of 99.33% and 98.83% on the publicly available WISDM and PAMPA2 datasets, respectively, surpassing state-of-the-art networks. Our method showcases the potential to display user actions and position information in a virtual world, enriching user experiences and interactions across diverse application scenarios.</p>","PeriodicalId":50645,"journal":{"name":"Computer Animation and Virtual Worlds","volume":"35 5","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Human action recognition in immersive virtual reality based on multi-scale spatio-temporal attention network\",\"authors\":\"Zhiyong Xiao, Yukun Chen, Xinlei Zhou, Mingwei He, Li Liu, Feng Yu, Minghua Jiang\",\"doi\":\"10.1002/cav.2293\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Wearable human action recognition (HAR) has practical applications in daily life. However, traditional HAR methods solely focus on identifying user movements, lacking interactivity and user engagement. This paper proposes a novel immersive HAR method called MovPosVR. Virtual reality (VR) technology is employed to create realistic scenes and enhance the user experience. To improve the accuracy of user action recognition in immersive HAR, a multi-scale spatio-temporal attention network (MSSTANet) is proposed. The network combines the convolutional residual squeeze and excitation (CRSE) module with the multi-branch convolution and long short-term memory (MCLSTM) module to extract spatio-temporal features and automatically select relevant features from action signals. Additionally, a multi-head attention with shared linear mechanism (MHASLM) module is designed to facilitate information interaction, further enhancing feature extraction and improving accuracy. The MSSTANet network achieves superior performance, with accuracy rates of 99.33% and 98.83% on the publicly available WISDM and PAMPA2 datasets, respectively, surpassing state-of-the-art networks. Our method showcases the potential to display user actions and position information in a virtual world, enriching user experiences and interactions across diverse application scenarios.</p>\",\"PeriodicalId\":50645,\"journal\":{\"name\":\"Computer Animation and Virtual Worlds\",\"volume\":\"35 5\",\"pages\":\"\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2024-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Animation and Virtual Worlds\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cav.2293\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Animation and Virtual Worlds","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cav.2293","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
摘要
可穿戴人体动作识别(HAR)在日常生活中有着实际应用。然而,传统的 HAR 方法只关注识别用户动作,缺乏互动性和用户参与性。本文提出了一种名为 MovPosVR 的新型沉浸式 HAR 方法。采用虚拟现实(VR)技术创建逼真的场景,增强用户体验。为了提高沉浸式 HAR 中用户动作识别的准确性,本文提出了一种多尺度时空注意力网络(MSSTANet)。该网络将卷积残差挤压和激励(CRSE)模块与多分支卷积和长短期记忆(MCLSTM)模块相结合,以提取时空特征,并从动作信号中自动选择相关特征。此外,还设计了具有共享线性机制的多头注意力(MHASLM)模块,以促进信息交互,进一步加强特征提取并提高准确性。MSSTANet 网络性能卓越,在公开的 WISDM 和 PAMPA2 数据集上的准确率分别达到 99.33% 和 98.83%,超过了最先进的网络。我们的方法展示了在虚拟世界中显示用户操作和位置信息的潜力,丰富了用户在各种应用场景中的体验和互动。
Human action recognition in immersive virtual reality based on multi-scale spatio-temporal attention network
Wearable human action recognition (HAR) has practical applications in daily life. However, traditional HAR methods solely focus on identifying user movements, lacking interactivity and user engagement. This paper proposes a novel immersive HAR method called MovPosVR. Virtual reality (VR) technology is employed to create realistic scenes and enhance the user experience. To improve the accuracy of user action recognition in immersive HAR, a multi-scale spatio-temporal attention network (MSSTANet) is proposed. The network combines the convolutional residual squeeze and excitation (CRSE) module with the multi-branch convolution and long short-term memory (MCLSTM) module to extract spatio-temporal features and automatically select relevant features from action signals. Additionally, a multi-head attention with shared linear mechanism (MHASLM) module is designed to facilitate information interaction, further enhancing feature extraction and improving accuracy. The MSSTANet network achieves superior performance, with accuracy rates of 99.33% and 98.83% on the publicly available WISDM and PAMPA2 datasets, respectively, surpassing state-of-the-art networks. Our method showcases the potential to display user actions and position information in a virtual world, enriching user experiences and interactions across diverse application scenarios.
期刊介绍:
With the advent of very powerful PCs and high-end graphics cards, there has been an incredible development in Virtual Worlds, real-time computer animation and simulation, games. But at the same time, new and cheaper Virtual Reality devices have appeared allowing an interaction with these real-time Virtual Worlds and even with real worlds through Augmented Reality. Three-dimensional characters, especially Virtual Humans are now of an exceptional quality, which allows to use them in the movie industry. But this is only a beginning, as with the development of Artificial Intelligence and Agent technology, these characters will become more and more autonomous and even intelligent. They will inhabit the Virtual Worlds in a Virtual Life together with animals and plants.