Xu Zhao , Chao Tang , Huosheng Hu , Wenjian Wang , Shuo Qiao , Anyang Tong
{"title":"基于注意机制的多模态特征融合网络人体动作识别","authors":"Xu Zhao , Chao Tang , Huosheng Hu , Wenjian Wang , Shuo Qiao , Anyang Tong","doi":"10.1016/j.jvcir.2025.104459","DOIUrl":null,"url":null,"abstract":"<div><div>Current human action recognition (HAR) methods focus on integrating multiple data modalities, such as skeleton data and RGB data. However, they struggle to exploit motion correlation information in skeleton data and rely on spatial representations from RGB modalities. This paper proposes a novel Attention-based Multimodal Feature Integration Network (AMFI-Net) designed to enhance modal fusion and improve recognition accuracy. First, RGB and skeleton data undergo multi-level preprocessing to obtain differential movement representations, which are then input into a heterogeneous network for separate multimodal feature extraction. Next, an adaptive fusion strategy is employed to enhance the integration of these multimodal features. Finally, the network assesses the confidence level of weighted skeleton information to determine the extent and type of appearance information to be used in the final feature integration. Experiments conducted on the NTU-RGB + D dataset demonstrate that the proposed method is feasible, leading to significant improvements in human action recognition accuracy.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104459"},"PeriodicalIF":2.6000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Attention mechanism based multimodal feature fusion network for human action recognition\",\"authors\":\"Xu Zhao , Chao Tang , Huosheng Hu , Wenjian Wang , Shuo Qiao , Anyang Tong\",\"doi\":\"10.1016/j.jvcir.2025.104459\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Current human action recognition (HAR) methods focus on integrating multiple data modalities, such as skeleton data and RGB data. However, they struggle to exploit motion correlation information in skeleton data and rely on spatial representations from RGB modalities. This paper proposes a novel Attention-based Multimodal Feature Integration Network (AMFI-Net) designed to enhance modal fusion and improve recognition accuracy. First, RGB and skeleton data undergo multi-level preprocessing to obtain differential movement representations, which are then input into a heterogeneous network for separate multimodal feature extraction. Next, an adaptive fusion strategy is employed to enhance the integration of these multimodal features. Finally, the network assesses the confidence level of weighted skeleton information to determine the extent and type of appearance information to be used in the final feature integration. Experiments conducted on the NTU-RGB + D dataset demonstrate that the proposed method is feasible, leading to significant improvements in human action recognition accuracy.</div></div>\",\"PeriodicalId\":54755,\"journal\":{\"name\":\"Journal of Visual Communication and Image Representation\",\"volume\":\"110 \",\"pages\":\"Article 104459\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Visual Communication and Image Representation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1047320325000732\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325000732","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Attention mechanism based multimodal feature fusion network for human action recognition
Current human action recognition (HAR) methods focus on integrating multiple data modalities, such as skeleton data and RGB data. However, they struggle to exploit motion correlation information in skeleton data and rely on spatial representations from RGB modalities. This paper proposes a novel Attention-based Multimodal Feature Integration Network (AMFI-Net) designed to enhance modal fusion and improve recognition accuracy. First, RGB and skeleton data undergo multi-level preprocessing to obtain differential movement representations, which are then input into a heterogeneous network for separate multimodal feature extraction. Next, an adaptive fusion strategy is employed to enhance the integration of these multimodal features. Finally, the network assesses the confidence level of weighted skeleton information to determine the extent and type of appearance information to be used in the final feature integration. Experiments conducted on the NTU-RGB + D dataset demonstrate that the proposed method is feasible, leading to significant improvements in human action recognition accuracy.
期刊介绍:
The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.