Large Language Model-Based Spatio-Temporal Semantic Enhancement for Skeleton Action Understanding

IF 1.3 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision Pub Date : 2025-09-14 DOI:10.1049/cvi2.70041

Ran Wei, Hui Jie Zhang, Chang Cao, Fang Zhang, Jun Ling Gao, Xiao Tian Li, Lei Geng

{"title":"Large Language Model-Based Spatio-Temporal Semantic Enhancement for Skeleton Action Understanding","authors":"Ran Wei, Hui Jie Zhang, Chang Cao, Fang Zhang, Jun Ling Gao, Xiao Tian Li, Lei Geng","doi":"10.1049/cvi2.70041","DOIUrl":null,"url":null,"abstract":"<p>Skeleton-based temporal action segmentation aims to segment and classify human actions in untrimmed skeletal sequences. Existing methods struggle with distinguishing transition poses between adjacent frames and fail to adequately capture semantic dependencies between joints and actions. To address these challenges, we propose a large language model-based spatio-temporal semantic enhancement (LLM-STSE) method, a novel framework that combines adaptive spatio-temporal axial attention (ASTA-Attention) and dynamic semantic-guided multimodal action segmentation (DSG-MAS). ASTA-Attention models spatial and temporal dependencies using axial attention, whereas DSG-MAS dynamically generates semantic prompts based on joint motion and fuses them with skeleton features for more accurate segmentation. Experiments on MCFS and PKU-MMD datasets show that LLM-STSE achieves state-of-the-art performance, significantly improving action segmentation, especially in complex transitions, with substantial F1 score gains across multiple public datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2025-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70041","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.70041","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Skeleton-based temporal action segmentation aims to segment and classify human actions in untrimmed skeletal sequences. Existing methods struggle with distinguishing transition poses between adjacent frames and fail to adequately capture semantic dependencies between joints and actions. To address these challenges, we propose a large language model-based spatio-temporal semantic enhancement (LLM-STSE) method, a novel framework that combines adaptive spatio-temporal axial attention (ASTA-Attention) and dynamic semantic-guided multimodal action segmentation (DSG-MAS). ASTA-Attention models spatial and temporal dependencies using axial attention, whereas DSG-MAS dynamically generates semantic prompts based on joint motion and fuses them with skeleton features for more accurate segmentation. Experiments on MCFS and PKU-MMD datasets show that LLM-STSE achieves state-of-the-art performance, significantly improving action segmentation, especially in complex transitions, with substantial F1 score gains across multiple public datasets.

Abstract Image

查看原文本刊更多论文

基于大语言模型的骨架动作理解时空语义增强

基于骨骼的时间动作分割旨在对未经修剪的骨骼序列中的人类动作进行分割和分类。现有的方法难以区分相邻帧之间的过渡姿势，并且无法充分捕获关节和动作之间的语义依赖关系。为了解决这些挑战，我们提出了一种基于大型语言模型的时空语义增强（LLM-STSE）方法，该方法结合了自适应时空轴向注意（sta - attention）和动态语义引导的多模态动作分割（DSG-MAS）。ASTA-Attention使用轴向注意建模空间和时间依赖关系，而DSG-MAS基于关节运动动态生成语义提示，并将其与骨架特征融合，以实现更准确的分割。在MCFS和PKU-MMD数据集上的实验表明，LLM-STSE达到了最先进的性能，显著改善了动作分割，特别是在复杂的转换中，在多个公共数据集上获得了可观的F1分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Computer Vision 工程技术-工程：电子与电气

CiteScore

3.30

自引率

11.80%

发文量

审稿时长

3.4 months

期刊介绍： IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf