Combining spatio-temporal attention and multi-level feature fusion for video saliency prediction

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-07-21 DOI:10.1016/j.imavis.2025.105678

Huiyu Luo

{"title":"Combining spatio-temporal attention and multi-level feature fusion for video saliency prediction","authors":"Huiyu Luo","doi":"10.1016/j.imavis.2025.105678","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, 3D convolution-based video saliency prediction models have adopted a fully convolutional encoder-decoder architecture to extract multi-level spatio-temporal features and achieved impressive performance. Deep level features encompass semantic information reflecting salient regions, shallow level features contain detailed information. However, these models have two issues: they fail to capture global information, and the equally weighted fusion mechanism they employ ignores the differences between deep and shallow features. To address these issues, we propose a novel model that combines spatio-temporal attention and multi-level feature fusion, with two main component, the global spatio-temporal correlation (GSC) structure and the attention-guided fusion (AGF) module. The GSC structure employs the Video Swin Transformer to capture global spatio-temporal correlations based on the deepest local spatio-temporal features through the multi-head attention mechanism. Rather than the equally weighted fusion mechanism, the proposed AGF module adaptively compute an attention map with only deep level features through spatio-temporal attention and channel attention branches, which guides the features to focus on salient regions and fuse. Extensive experiments over four datasets demonstrate the proposed model achieves comparable performance against state-of-the-art models and the effectiveness of each component of our model.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105678"},"PeriodicalIF":4.2000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625002665","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, 3D convolution-based video saliency prediction models have adopted a fully convolutional encoder-decoder architecture to extract multi-level spatio-temporal features and achieved impressive performance. Deep level features encompass semantic information reflecting salient regions, shallow level features contain detailed information. However, these models have two issues: they fail to capture global information, and the equally weighted fusion mechanism they employ ignores the differences between deep and shallow features. To address these issues, we propose a novel model that combines spatio-temporal attention and multi-level feature fusion, with two main component, the global spatio-temporal correlation (GSC) structure and the attention-guided fusion (AGF) module. The GSC structure employs the Video Swin Transformer to capture global spatio-temporal correlations based on the deepest local spatio-temporal features through the multi-head attention mechanism. Rather than the equally weighted fusion mechanism, the proposed AGF module adaptively compute an attention map with only deep level features through spatio-temporal attention and channel attention branches, which guides the features to focus on salient regions and fuse. Extensive experiments over four datasets demonstrate the proposed model achieves comparable performance against state-of-the-art models and the effectiveness of each component of our model.

查看原文本刊更多论文

结合时空注意力和多层次特征融合的视频显著性预测

近年来，基于三维卷积的视频显著性预测模型采用了全卷积的编码器-解码器架构来提取多层次的时空特征，并取得了令人满意的效果。深层特征包含反映突出区域的语义信息，浅层特征包含详细信息。然而，这些模型有两个问题：它们无法捕获全局信息，并且它们采用的等加权融合机制忽略了深层和浅层特征之间的差异。为了解决这些问题，我们提出了一种结合时空注意和多层次特征融合的新模型，该模型由全局时空相关（GSC）结构和注意引导融合（AGF）模块两个主要组成部分组成。GSC结构采用Video Swin Transformer，通过多头注意机制捕获基于最深层局部时空特征的全局时空相关性。与等加权的融合机制不同，该算法通过时空注意和通道注意分支自适应地计算出仅包含深层特征的注意图，引导特征向显著区域集中并融合。在四个数据集上进行的大量实验表明，所提出的模型与最先进的模型和我们模型的每个组件的有效性具有可比性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.