Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication Pub Date : 2024-11-23 DOI:10.1016/j.image.2024.117244

Hanbo Wu, Xin Ma, Yibin Li

{"title":"Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos","authors":"Hanbo Wu, Xin Ma, Yibin Li","doi":"10.1016/j.image.2024.117244","DOIUrl":null,"url":null,"abstract":"<div><div>Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117244"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596524001450","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.

查看原文本刊更多论文

基于变换的多视点时空特征交互融合深度视频人体动作识别

时空特征建模是人体动作识别任务的关键。多视图数据有助于获取大量线索，提高特征描述的鲁棒性和准确性。然而，多视点动作识别还没有得到很好的探索。现有的方法大多只从单个视图进行动作识别，这导致了性能的限制。深度数据对光照和颜色变化不敏感，提供可靠的人体三维几何信息，具有显著的优势。在本研究中，我们将重点放在深度视频的动作识别上，引入一种基于transform的多视点时空特征交互融合框架，通过深度融合多视点信息实现有效的动作识别。具体而言，该框架包括视图内时空特征建模（ISTFM）和跨视图特征交互融合（CFIF）。首先，我们将深度视频投影到三个正交视图中，构建描述人类行为三维时空演变的多视图深度动态体。ISTFM以多视图深度动态体为输入，利用3D CNN提取三个视图的时空特征，然后利用transformer中的自关注机制来模拟每个视图内的全局上下文依赖关系。CFIF随后将自注意扩展为交叉注意，在不同视图之间进行深度交互，并进一步将交叉视图特征整合在一起，生成多视图联合特征表示。在两个大规模的RGBD数据集上进行了大量的实验，证明了该方法在提高识别性能方面的显著改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal Processing-Image Communication 工程技术-工程：电子与电气

CiteScore

8.40

自引率

2.90%

发文量

138

审稿时长

5.2 months

期刊介绍： Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.