Runjie Li , Ning He , Jinhua Wang , Fengxi Sun , Hongfei Liu
{"title":"GLST-Net:用于骨骼动作识别的全局和局部时空特征融合网络","authors":"Runjie Li , Ning He , Jinhua Wang , Fengxi Sun , Hongfei Liu","doi":"10.1016/j.jvcir.2025.104515","DOIUrl":null,"url":null,"abstract":"<div><div>The Graph Convolutional Network (GCN) has been widely applied in skeleton-based action recognition. However, GCNs typically operate based on local graph structures, which limits their ability to recognize and process complex long-range relationships between joints. The proposed GLST-Net consists of three main modules: the Global–Local Dual-Stream Feature Extraction (GLDE) module, the Multi-Scale Temporal Difference Modeling (MTDM) module, and the Temporal Feature Extraction (TFE) module. The GLDE is designed to capture both global and local feature information throughout the motion process and dynamically combine these two types of features. Additionally, since motion is defined by the changes observed between consecutive frames, MTDM extracts inter-frame difference information by calculating differences across multiple time scales, thereby enhancing the model’s temporal modeling capability. Finally, TFE effectively strengthens the model’s ability to extract temporal features. Extensive experiments conducted on the challenging NTU-RGB+D and UAV-Human datasets demonstrate the effectiveness and superiority of the proposed method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104515"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GLST-Net: Global and local spatio-temporal feature fusion network for skeleton-based action recognition\",\"authors\":\"Runjie Li , Ning He , Jinhua Wang , Fengxi Sun , Hongfei Liu\",\"doi\":\"10.1016/j.jvcir.2025.104515\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The Graph Convolutional Network (GCN) has been widely applied in skeleton-based action recognition. However, GCNs typically operate based on local graph structures, which limits their ability to recognize and process complex long-range relationships between joints. The proposed GLST-Net consists of three main modules: the Global–Local Dual-Stream Feature Extraction (GLDE) module, the Multi-Scale Temporal Difference Modeling (MTDM) module, and the Temporal Feature Extraction (TFE) module. The GLDE is designed to capture both global and local feature information throughout the motion process and dynamically combine these two types of features. Additionally, since motion is defined by the changes observed between consecutive frames, MTDM extracts inter-frame difference information by calculating differences across multiple time scales, thereby enhancing the model’s temporal modeling capability. Finally, TFE effectively strengthens the model’s ability to extract temporal features. Extensive experiments conducted on the challenging NTU-RGB+D and UAV-Human datasets demonstrate the effectiveness and superiority of the proposed method.</div></div>\",\"PeriodicalId\":54755,\"journal\":{\"name\":\"Journal of Visual Communication and Image Representation\",\"volume\":\"111 \",\"pages\":\"Article 104515\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Visual Communication and Image Representation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1047320325001294\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325001294","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
GLST-Net: Global and local spatio-temporal feature fusion network for skeleton-based action recognition
The Graph Convolutional Network (GCN) has been widely applied in skeleton-based action recognition. However, GCNs typically operate based on local graph structures, which limits their ability to recognize and process complex long-range relationships between joints. The proposed GLST-Net consists of three main modules: the Global–Local Dual-Stream Feature Extraction (GLDE) module, the Multi-Scale Temporal Difference Modeling (MTDM) module, and the Temporal Feature Extraction (TFE) module. The GLDE is designed to capture both global and local feature information throughout the motion process and dynamically combine these two types of features. Additionally, since motion is defined by the changes observed between consecutive frames, MTDM extracts inter-frame difference information by calculating differences across multiple time scales, thereby enhancing the model’s temporal modeling capability. Finally, TFE effectively strengthens the model’s ability to extract temporal features. Extensive experiments conducted on the challenging NTU-RGB+D and UAV-Human datasets demonstrate the effectiveness and superiority of the proposed method.
期刊介绍:
The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.