{"title":"PredVSD: Video saliency prediction based on conditional diffusion model","authors":"Chenming Li , Shiguang Liu","doi":"10.1016/j.knosys.2025.113820","DOIUrl":null,"url":null,"abstract":"<div><div>Mainstream deep learning methods for video saliency prediction often use 3D CNNs or Vision Transformers as encoder–decoders, relying on task-specific loss functions to implicitly map input frames to saliency maps. However, these methods are limited by their capacity for salient feature expression. In this study, inspired by the recent advances of diffusion models in video processing tasks, we propose a Conditional Diffusion Model for Video Saliency Prediction (PredVSD), which leverages semantic video features and saliency-specific encodings as conditions to capture more representative saliency features from the target data distribution. To effectively integrate multi-scale visual features and saliency priors, we design an auxiliary network, Saliency-PyramidU-Net, allowing the denoising process to focus more on salient regions across the spatial–temporal plane. Extensive experiments confirm PredVSD’s strong performance across visual and audio-visual datasets.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"324 ","pages":"Article 113820"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125008664","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Mainstream deep learning methods for video saliency prediction often use 3D CNNs or Vision Transformers as encoder–decoders, relying on task-specific loss functions to implicitly map input frames to saliency maps. However, these methods are limited by their capacity for salient feature expression. In this study, inspired by the recent advances of diffusion models in video processing tasks, we propose a Conditional Diffusion Model for Video Saliency Prediction (PredVSD), which leverages semantic video features and saliency-specific encodings as conditions to capture more representative saliency features from the target data distribution. To effectively integrate multi-scale visual features and saliency priors, we design an auxiliary network, Saliency-PyramidU-Net, allowing the denoising process to focus more on salient regions across the spatial–temporal plane. Extensive experiments confirm PredVSD’s strong performance across visual and audio-visual datasets.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.