Dandan Zhu;Kaiwei Zhang;Kun Zhu;Nana Zhang;Weiping Ding;Guangtao Zhai;Xiaokang Yang
{"title":"从离散表征到连续建模:具有内隐神经表征的新型视听显著性预测模型","authors":"Dandan Zhu;Kaiwei Zhang;Kun Zhu;Nana Zhang;Weiping Ding;Guangtao Zhai;Xiaokang Yang","doi":"10.1109/TETCI.2024.3386619","DOIUrl":null,"url":null,"abstract":"In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 6","pages":"4059-4074"},"PeriodicalIF":5.3000,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model With Implicit Neural Representations\",\"authors\":\"Dandan Zhu;Kaiwei Zhang;Kun Zhu;Nana Zhang;Weiping Ding;Guangtao Zhai;Xiaokang Yang\",\"doi\":\"10.1109/TETCI.2024.3386619\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"8 6\",\"pages\":\"4059-4074\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10502245/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10502245/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model With Implicit Neural Representations
In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.
期刊介绍:
The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys.
TETCI is an electronics only publication. TETCI publishes six issues per year.
Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.