{"title":"Cmf-转换器:用于人类动作识别的跨模态融合转换器","authors":"Jun Wang, Limin Xia, Xin Wen","doi":"10.1007/s00138-024-01598-0","DOIUrl":null,"url":null,"abstract":"<p>In human action recognition, both spatio-temporal videos and skeleton features alone can achieve good recognition performance, however, how to combine these two modalities to achieve better performance is still a worthy research direction. In order to better combine the two modalities, we propose a novel Cross-Modal Transformer for human action recognition—CMF-Transformer, which effectively fuses two different modalities. In spatio-temporal modality, video frames are used as inputs and directional attention is used in the transformer to obtain the order of recognition between different spatio-temporal blocks. In skeleton joint modality, skeleton joints are used as inputs to explore more complete correlations in different skeleton joints by spatio-temporal cross-attention in the transformer. Subsequently, a multimodal collaborative recognition strategy is used to identify the respective features and connectivity features of two modalities separately, and then weight the identification results separately to synergistically identify target action by fusing the features under the two modalities. A series of experiments on three benchmark datasets demonstrate that the performance of CMF-Transformer in this paper outperforms most current state-of-the-art methods.\n</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"1 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cmf-transformer: cross-modal fusion transformer for human action recognition\",\"authors\":\"Jun Wang, Limin Xia, Xin Wen\",\"doi\":\"10.1007/s00138-024-01598-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In human action recognition, both spatio-temporal videos and skeleton features alone can achieve good recognition performance, however, how to combine these two modalities to achieve better performance is still a worthy research direction. In order to better combine the two modalities, we propose a novel Cross-Modal Transformer for human action recognition—CMF-Transformer, which effectively fuses two different modalities. In spatio-temporal modality, video frames are used as inputs and directional attention is used in the transformer to obtain the order of recognition between different spatio-temporal blocks. In skeleton joint modality, skeleton joints are used as inputs to explore more complete correlations in different skeleton joints by spatio-temporal cross-attention in the transformer. Subsequently, a multimodal collaborative recognition strategy is used to identify the respective features and connectivity features of two modalities separately, and then weight the identification results separately to synergistically identify target action by fusing the features under the two modalities. A series of experiments on three benchmark datasets demonstrate that the performance of CMF-Transformer in this paper outperforms most current state-of-the-art methods.\\n</p>\",\"PeriodicalId\":51116,\"journal\":{\"name\":\"Machine Vision and Applications\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine Vision and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00138-024-01598-0\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-024-01598-0","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Cmf-transformer: cross-modal fusion transformer for human action recognition
In human action recognition, both spatio-temporal videos and skeleton features alone can achieve good recognition performance, however, how to combine these two modalities to achieve better performance is still a worthy research direction. In order to better combine the two modalities, we propose a novel Cross-Modal Transformer for human action recognition—CMF-Transformer, which effectively fuses two different modalities. In spatio-temporal modality, video frames are used as inputs and directional attention is used in the transformer to obtain the order of recognition between different spatio-temporal blocks. In skeleton joint modality, skeleton joints are used as inputs to explore more complete correlations in different skeleton joints by spatio-temporal cross-attention in the transformer. Subsequently, a multimodal collaborative recognition strategy is used to identify the respective features and connectivity features of two modalities separately, and then weight the identification results separately to synergistically identify target action by fusing the features under the two modalities. A series of experiments on three benchmark datasets demonstrate that the performance of CMF-Transformer in this paper outperforms most current state-of-the-art methods.
期刊介绍:
Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal.
Particular emphasis is placed on engineering and technology aspects of image processing and computer vision.
The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.