用于基于骨骼的动作识别的局部和全局自注意力增强型图卷积网络

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-10-31 DOI:10.1016/j.patcog.2024.111106

Zhize Wu , Yue Ding , Long Wan , Teng Li , Fudong Nian

{"title":"用于基于骨骼的动作识别的局部和全局自注意力增强型图卷积网络","authors":"Zhize Wu , Yue Ding , Long Wan , Teng Li , Fudong Nian","doi":"10.1016/j.patcog.2024.111106","DOIUrl":null,"url":null,"abstract":"<div><div>The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at <span><span>https://github.com/DingYyue/LG-SGNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111106"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition\",\"authors\":\"Zhize Wu , Yue Ding , Long Wan , Teng Li , Fudong Nian\",\"doi\":\"10.1016/j.patcog.2024.111106\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at <span><span>https://github.com/DingYyue/LG-SGNet</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"159 \",\"pages\":\"Article 111106\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320324008574\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008574","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

目前，基于骨骼的动作识别的成功范例是图形卷积网络（GCN）与时态卷积网络（TCN）的结合，前者模拟空间相关性，后者提取运动特征。这种基于 GCN-TCN 的方法通常依赖于局部图卷积运算，这就限制了它们捕捉远处关节间复杂关联以及表示长距离依赖关系的能力。虽然源自变形器的自注意力在全局关节的相关性建模方面显示出巨大潜力，但基于变形器的方法通常计算成本高昂，而且忽略了人体骨骼的物理连接结构。为解决这些问题，我们提出了一种新颖的局部-全局自注意力增强图卷积网络（LG-SGNet），可同时学习时空维度的局部和全局表征。我们的方法由三个部分组成：局部-全局图卷积网络（LG-GCN）模块通过并行的特定信道全局和局部空间建模，提取局部和全局空间特征表征。局部-全局时空卷积网络（LG-TCN）模块在进行局部时空建模的同时，利用多头自注意力联合进行全局时空建模。这构成了一种新的多分支时空卷积结构，能有效捕捉长距离依赖关系和微妙的时空结构。最后，动态帧加权模块（DFWM）可调整骨架动作序列帧的权重，使模型能够自适应地关注代表性帧的特征，从而提高动作识别的效率。广泛的实验证明，与最先进的方法相比，我们的 LG-SGNet 的性能极具竞争力。我们的项目网站是 https://github.com/DingYyue/LG-SGNet。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition

The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at https://github.com/DingYyue/LG-SGNet.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.