从离散表征到连续建模:具有内隐神经表征的新型视听显著性预测模型

IF 5.3 3区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Dandan Zhu;Kaiwei Zhang;Kun Zhu;Nana Zhang;Weiping Ding;Guangtao Zhai;Xiaokang Yang
{"title":"从离散表征到连续建模:具有内隐神经表征的新型视听显著性预测模型","authors":"Dandan Zhu;Kaiwei Zhang;Kun Zhu;Nana Zhang;Weiping Ding;Guangtao Zhai;Xiaokang Yang","doi":"10.1109/TETCI.2024.3386619","DOIUrl":null,"url":null,"abstract":"In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 6","pages":"4059-4074"},"PeriodicalIF":5.3000,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model With Implicit Neural Representations\",\"authors\":\"Dandan Zhu;Kaiwei Zhang;Kun Zhu;Nana Zhang;Weiping Ding;Guangtao Zhai;Xiaokang Yang\",\"doi\":\"10.1109/TETCI.2024.3386619\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"8 6\",\"pages\":\"4059-4074\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10502245/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10502245/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

在深度学习时代,由于视频信号的复杂性和时间维度上的连续相关性,视听显著性预测仍处于起步阶段。现有的大多数方法将视频视为 RGB 值的三维网格,并使用离散神经网络对其进行建模,从而导致视频内容无关性和特征表示能力未达到最佳等问题。为了应对这些挑战,我们提出了一种基于隐式神经表征(INR)的新型动态感知视听显著性(DAVS)模型。我们提出的 DAVS 模型的核心是利用参数神经网络建立有效的映射,将时空坐标映射到相应的显著性值。具体来说,我们的模型包含一个基于 INR 的视频生成器,它能将视频分解为图像、运动和音频特征向量,并通过参数神经网络学习视频内容自适应特征。这种生成器能有效地对视频进行编码,自然地建立连续的时间动态模型,并增强特征表示能力。此外,我们还在显著性预测过程中引入了参数化视听特征融合策略,实现了模态之间的内在交互,并自适应地整合了视听线索。通过在基准数据集上的广泛实验,我们提出的 DAVS 模型在视听显著性预测中表现出了良好的性能和引人入胜的特性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model With Implicit Neural Representations
In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
10.30
自引率
7.50%
发文量
147
期刊介绍: The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信