Liping Nong , Zhuocheng Huang , Junyi Wang , Yanpeng Rong , Jie Peng , Yiping Huang
{"title":"Combine multi-order representation learning and frame optimization learning for skeleton-based action recognition","authors":"Liping Nong , Zhuocheng Huang , Junyi Wang , Yanpeng Rong , Jie Peng , Yiping Huang","doi":"10.1016/j.dsp.2024.104823","DOIUrl":null,"url":null,"abstract":"<div><div>Skeleton-based action recognition has broad application prospects in many fields such as virtual reality. Currently, the most popular way is to employ Graph Convolutional Networks (GCNs) or Hypergraph Convolutional Networks (HGCNs) for this task. However, GCN-based methods may heavily rely on the physical connectivity relationship between joints while lack the capture of higher-order information about interactions among distant joints, and HGCN-based methods usually introduce unnecessary noise when capturing low-order information of skeleton structures with simple topology. Besides, the current methods do not deal well with redundant frames and confusing frames. These limitations hinder the improvement of recognition accuracy. In this paper, we propose a novel network, called Hyper-Net, which combines multi-order representation learning and frame optimization learning for skeleton-based action recognition. Specifically, the proposed Hyper-Net contains Temporal-Channel Aggregation Graph Convolution (TCA-GC), Spatial-Temporal Aggregation Hypergraph Convolution (STA-HC) and Frame Optimization Learning (F-OL) modules. The TCA-GC aggregates low-order and local information from simple joint and bone topologies across different temporal and channel dimensions. The STA-HC captures high-order and global information from complex motion streams as well as solving the problem of spatial-temporal weight imbalance. The F-OL can adaptively extract key frames and distinguish confusing frames, thus improving the ability of the network to recognize confusing actions. A large number of experiments are conducted on the NTU RGB+D, NTU RGB+D 120 and NW-UCLA datasets for action recognition task. Experimental results demonstrate the superiority and effectiveness of the proposed network.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104823"},"PeriodicalIF":2.9000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424004482","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Skeleton-based action recognition has broad application prospects in many fields such as virtual reality. Currently, the most popular way is to employ Graph Convolutional Networks (GCNs) or Hypergraph Convolutional Networks (HGCNs) for this task. However, GCN-based methods may heavily rely on the physical connectivity relationship between joints while lack the capture of higher-order information about interactions among distant joints, and HGCN-based methods usually introduce unnecessary noise when capturing low-order information of skeleton structures with simple topology. Besides, the current methods do not deal well with redundant frames and confusing frames. These limitations hinder the improvement of recognition accuracy. In this paper, we propose a novel network, called Hyper-Net, which combines multi-order representation learning and frame optimization learning for skeleton-based action recognition. Specifically, the proposed Hyper-Net contains Temporal-Channel Aggregation Graph Convolution (TCA-GC), Spatial-Temporal Aggregation Hypergraph Convolution (STA-HC) and Frame Optimization Learning (F-OL) modules. The TCA-GC aggregates low-order and local information from simple joint and bone topologies across different temporal and channel dimensions. The STA-HC captures high-order and global information from complex motion streams as well as solving the problem of spatial-temporal weight imbalance. The F-OL can adaptively extract key frames and distinguish confusing frames, thus improving the ability of the network to recognize confusing actions. A large number of experiments are conducted on the NTU RGB+D, NTU RGB+D 120 and NW-UCLA datasets for action recognition task. Experimental results demonstrate the superiority and effectiveness of the proposed network.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,