Mining the Salient Spatio-Temporal Feature with S2TF-Net for action recognition

IF 2.7 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication Pub Date : 2025-07-15 DOI:10.1016/j.image.2025.117381

Xiaoxi Liu , Ju Liu , Lingchen Gu , Yafeng Li , Xiaojun Chang , Feiping Nie

{"title":"Mining the Salient Spatio-Temporal Feature with S2TF-Net for action recognition","authors":"Xiaoxi Liu , Ju Liu , Lingchen Gu , Yafeng Li , Xiaojun Chang , Feiping Nie","doi":"10.1016/j.image.2025.117381","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, 3D Convolutional Neural Networks (3D ConvNets) have been widely exploited for action recognition and achieved satisfying performance. However, the superior action features are often drowned in numerous irrelevant information, which immensely enhances the difficulty of video representation. To find a generic cost-efficient approach to balance the parameters and performance, we present a novel network to mine the Salient Spatio-Temporal Feature based on 3D ConvNets backbone for action recognition, termed as S2TF-Net. Firstly, we extract the salient features of each 3D residual block by constructing a multi-scale module for Salient Semantic Feature mining (SSF-Module). Then, with the aim of preserving the salient features in pooling operations, we establish a Two-branch Salient Feature Preserving Module (TSFP-Module). Besides, these above two modules with proper loss function can collaborate in an “easy-to-concat” fashion for most 3D ResNet backbones to classify more accurately albeit in the shallower network. Finally, we conduct experiments over three popular action recognition datasets, where our S2TF-Net is competitive compared with the deeper 3D backbones or current state-of-the-art results. Treating the P3D, 3D ResNet, Non-local I3D and X3D as baseline, the proposed method improves them to varying degrees. Particularly, for Non-local I3D ResNet, the proposed S2TF-Net enhances 4.1%, 3.0% and 4.6% in Kinetics-400, UCF101 and HMDB51 datasets, achieving the accuracy of 74.8%, 95.1% and 80.9%. We hope this study will provide useful inspiration and experience for future research about more cost-effective methods. Code is released here: https://github.com/xiaoxiAries/S2TFNet<svg><path></path></svg>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117381"},"PeriodicalIF":2.7000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596525001274","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, 3D Convolutional Neural Networks (3D ConvNets) have been widely exploited for action recognition and achieved satisfying performance. However, the superior action features are often drowned in numerous irrelevant information, which immensely enhances the difficulty of video representation. To find a generic cost-efficient approach to balance the parameters and performance, we present a novel network to mine the Salient Spatio-Temporal Feature based on 3D ConvNets backbone for action recognition, termed as S²TF-Net. Firstly, we extract the salient features of each 3D residual block by constructing a multi-scale module for Salient Semantic Feature mining (SSF-Module). Then, with the aim of preserving the salient features in pooling operations, we establish a Two-branch Salient Feature Preserving Module (TSFP-Module). Besides, these above two modules with proper loss function can collaborate in an “easy-to-concat” fashion for most 3D ResNet backbones to classify more accurately albeit in the shallower network. Finally, we conduct experiments over three popular action recognition datasets, where our S²TF-Net is competitive compared with the deeper 3D backbones or current state-of-the-art results. Treating the P3D, 3D ResNet, Non-local I3D and X3D as baseline, the proposed method improves them to varying degrees. Particularly, for Non-local I3D ResNet, the proposed S²TF-Net enhances 4.1%, 3.0% and 4.6% in Kinetics-400, UCF101 and HMDB51 datasets, achieving the accuracy of 74.8%, 95.1% and 80.9%. We hope this study will provide useful inspiration and experience for future research about more cost-effective methods. Code is released here: https://github.com/xiaoxiAries/S2TFNet.

查看原文本刊更多论文

基于S2TF-Net的动作识别显著时空特征挖掘

近年来，三维卷积神经网络（3D ConvNets）在动作识别中得到了广泛的应用，并取得了令人满意的效果。然而，优秀的动作特征往往被淹没在大量无关信息中，这极大地增加了视频表示的难度。为了找到一种通用的经济有效的方法来平衡参数和性能，我们提出了一种新的网络来挖掘突出的时空特征基于三维卷积神经网络骨干网的动作识别，称为S2TF-Net。首先，通过构建显著语义特征挖掘（SSF-Module）的多尺度模块，提取每个三维残差块的显著特征；然后，以保留池化操作中的显著特征为目的，我们建立了一个两分支显著特征保留模块（TSFP-Module）。此外，上述两个具有适当损失函数的模块可以以“易于连接”的方式协作，使大多数3D ResNet骨干网在较浅的网络中更准确地分类。最后，我们在三个流行的动作识别数据集上进行了实验，其中我们的S2TF-Net与更深的3D主干或当前最先进的结果相比具有竞争力。以P3D、3D ResNet、Non-local I3D和X3D为基准，该方法均有不同程度的改进。特别是对于非本地I3D ResNet，本文提出的S2TF-Net在kinect -400、UCF101和HMDB51数据集上的准确率分别提高了4.1%、3.0%和4.6%，分别达到74.8%、95.1%和80.9%。希望本研究能为今后研究更经济有效的方法提供有益的启示和经验。代码在这里发布：https://github.com/xiaoxiAries/S2TFNet。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal Processing-Image Communication 工程技术-工程：电子与电气

CiteScore

8.40

自引率

2.90%

发文量

138

审稿时长

5.2 months

期刊介绍： Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.