Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognition

IF 0.9 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Cheng Zhang, Jianqi Zhong, Wenming Cao, Jianhua Ji
{"title":"Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognition","authors":"Cheng Zhang, Jianqi Zhong, Wenming Cao, Jianhua Ji","doi":"10.3233/ida-230399","DOIUrl":null,"url":null,"abstract":"Unsupervised action recognition based on spatiotemporal fusion feature extraction has attracted much attention in recent years. However, existing methods still have several limitations: (1) The long-term dependence relationship is not effectively extracted at the time level. (2) The high-order motion relationship between non-adjacent nodes is not effectively captured at the spatial level. (3) The model complexity is too high when the cascade layer input sequence is long, or there are many key points. To solve these problems, a Multiple Distilling-based spatial-temporal attention (MD-STA) networks is proposed in this paper. This model can extract temporal and spatial features respectively and fuse them. Specifically, we first propose a Screening Self-attention (SSA) module; this module can find long-term dependencies in distant frames and high-order motion patterns between non-adjacent nodes in a single frame through a sparse metric on dot product pairs. Then, we propose the Frames and Keypoint-Distilling (FKD) module, which uses extraction operations to halve the input of the cascade layer to eliminate invalid key points and time frame features, thus reducing time and memory complexity. Finally, the Dim-reduction Fusion (DRF) module is proposed to reduce the dimension of existing features to further eliminate redundancy. Numerous experiments were conducted on three distinct datasets: NTU-60, NTU-120, and UWA3D, showing that MD-STA achieves state-of-the-art standards in skeleton-based unsupervised action recognition.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"131 1","pages":"0"},"PeriodicalIF":0.9000,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/ida-230399","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Unsupervised action recognition based on spatiotemporal fusion feature extraction has attracted much attention in recent years. However, existing methods still have several limitations: (1) The long-term dependence relationship is not effectively extracted at the time level. (2) The high-order motion relationship between non-adjacent nodes is not effectively captured at the spatial level. (3) The model complexity is too high when the cascade layer input sequence is long, or there are many key points. To solve these problems, a Multiple Distilling-based spatial-temporal attention (MD-STA) networks is proposed in this paper. This model can extract temporal and spatial features respectively and fuse them. Specifically, we first propose a Screening Self-attention (SSA) module; this module can find long-term dependencies in distant frames and high-order motion patterns between non-adjacent nodes in a single frame through a sparse metric on dot product pairs. Then, we propose the Frames and Keypoint-Distilling (FKD) module, which uses extraction operations to halve the input of the cascade layer to eliminate invalid key points and time frame features, thus reducing time and memory complexity. Finally, the Dim-reduction Fusion (DRF) module is proposed to reduce the dimension of existing features to further eliminate redundancy. Numerous experiments were conducted on three distinct datasets: NTU-60, NTU-120, and UWA3D, showing that MD-STA achieves state-of-the-art standards in skeleton-based unsupervised action recognition.
基于多提取的时空注意网络的无监督人类行为识别
基于时空融合特征提取的无监督动作识别近年来备受关注。然而,现有的方法仍然存在一些局限性:(1)不能在时间层面上有效地提取长期依赖关系。(2)非相邻节点之间的高阶运动关系在空间层面上没有得到有效捕获。(3)当级联层输入序列较长或关键点较多时,模型复杂度过高。为了解决这些问题,本文提出了一种基于多重提取的时空注意网络。该模型可以分别提取时空特征并进行融合。具体而言,我们首先提出了筛选自我注意(SSA)模块;该模块可以通过点积对上的稀疏度量来发现远距离帧之间的长期依赖关系和单帧内非相邻节点之间的高阶运动模式。然后,我们提出了帧和关键点提取(FKD)模块,该模块使用提取操作将级联层的输入减半,以消除无效的关键点和时间帧特征,从而降低时间和内存复杂度。最后,提出了Dim-reduction Fusion (DRF)模块,对已有特征进行降维,进一步消除冗余。在三个不同的数据集上进行了大量的实验:NTU-60、NTU-120和UWA3D,表明MD-STA在基于骨架的无监督动作识别方面达到了最先进的标准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Intelligent Data Analysis
Intelligent Data Analysis 工程技术-计算机:人工智能
CiteScore
2.20
自引率
5.90%
发文量
85
审稿时长
3.3 months
期刊介绍: Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信