MDT: A multiscale differencing transformer with sequence feature relationship mining for robust action recognition

IF 3.5 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2025-08-30 DOI:10.1007/s10489-025-06861-z

Zengzhao Chen, Fumei Ma, Hai Liu, Wenkai Huang, Tingting Liu

{"title":"MDT: A multiscale differencing transformer with sequence feature relationship mining for robust action recognition","authors":"Zengzhao Chen, Fumei Ma, Hai Liu, Wenkai Huang, Tingting Liu","doi":"10.1007/s10489-025-06861-z","DOIUrl":null,"url":null,"abstract":"<div><p>Skeleton-based action recognition, which analyzes joint coordinates and bone connections to classify human actions, is important in understanding and analyzing human dynamic behaviors. However, actions in complex scenes have a high degree of similarity and variability, with the dynamic changes in human skeletons and subtle temporal variations in particular posing significant challenges to the accuracy and robustness of action recognition systems. To mitigate these challenges, we propose a novel multiscale differencing transformer (MDT) with sequence feature relationship mining for robust action recognition. MDT effectively mines inter-frame timing information and feature distribution differences across multiple scales, enabling a deeper understanding of the nuances between actions. Specifically, we first propose multiscale differential self-attention to handle the need for understanding action changes across multiple time scales, improving the capacity of the model to effectively capture the global and local dynamic features of actions. Then, we introduce a sequence feature relationship mining module to address complex data patterns in scenes that may span multiple sequences, exhibiting both similar and distinct characteristics. By utilizing coarse- and fine-grained sequence information, this module empowers the model to recognize intricate data patterns. On the NTU RGB+D 60 dataset, the proposed MDT model outperforms the recent STAR-Transformer by 1.6% on the Cross-Subject (CS) setting and 1.1% on the Cross-View (CV) setting, demonstrating its consistent effectiveness across different evaluation protocols.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 13","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06861-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Skeleton-based action recognition, which analyzes joint coordinates and bone connections to classify human actions, is important in understanding and analyzing human dynamic behaviors. However, actions in complex scenes have a high degree of similarity and variability, with the dynamic changes in human skeletons and subtle temporal variations in particular posing significant challenges to the accuracy and robustness of action recognition systems. To mitigate these challenges, we propose a novel multiscale differencing transformer (MDT) with sequence feature relationship mining for robust action recognition. MDT effectively mines inter-frame timing information and feature distribution differences across multiple scales, enabling a deeper understanding of the nuances between actions. Specifically, we first propose multiscale differential self-attention to handle the need for understanding action changes across multiple time scales, improving the capacity of the model to effectively capture the global and local dynamic features of actions. Then, we introduce a sequence feature relationship mining module to address complex data patterns in scenes that may span multiple sequences, exhibiting both similar and distinct characteristics. By utilizing coarse- and fine-grained sequence information, this module empowers the model to recognize intricate data patterns. On the NTU RGB+D 60 dataset, the proposed MDT model outperforms the recent STAR-Transformer by 1.6% on the Cross-Subject (CS) setting and 1.1% on the Cross-View (CV) setting, demonstrating its consistent effectiveness across different evaluation protocols.

查看原文本刊更多论文

基于序列特征关系挖掘的多尺度差分变压器鲁棒动作识别

基于骨骼的动作识别，通过分析关节坐标和骨骼连接对人体动作进行分类，是理解和分析人体动态行为的重要手段。然而，复杂场景中的动作具有高度的相似性和可变性，尤其是人体骨骼的动态变化和微妙的时间变化，对动作识别系统的准确性和鲁棒性提出了重大挑战。为了缓解这些挑战，我们提出了一种新的多尺度差分变压器（MDT），该变压器采用序列特征关系挖掘进行鲁棒动作识别。MDT有效地挖掘帧间时间信息和跨多个尺度的特征分布差异，从而更深入地了解动作之间的细微差别。具体来说，我们首先提出了多尺度差分自注意来处理理解跨时间尺度的动作变化的需要，提高了模型有效捕捉动作的全局和局部动态特征的能力。然后，我们引入了一个序列特征关系挖掘模块来处理场景中的复杂数据模式，这些场景可能跨越多个序列，表现出相似和不同的特征。通过利用粗粒度和细粒度的序列信息，该模块使模型能够识别复杂的数据模式。在NTU RGB+D 60数据集上，提出的MDT模型在交叉主题（CS）设置上优于最近的STAR-Transformer 1.6%，在交叉视图（CV）设置上优于1.1%，表明其在不同评估协议中的一致性有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.