Event-level multimodal feature fusion for audio–visual event localization

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-06-20 DOI:10.1016/j.imavis.2025.105610

Jing Zhang, Yi Yu, Yuyao Mao, Yonggong Ren

{"title":"Event-level multimodal feature fusion for audio–visual event localization","authors":"Jing Zhang, Yi Yu, Yuyao Mao, Yonggong Ren","doi":"10.1016/j.imavis.2025.105610","DOIUrl":null,"url":null,"abstract":"<div><div>Audio–visual event localization, by identifying audio–visual segments most relevant to semantics from long video sequences, has become a crucial prerequisite for applications such as video content understanding and editing. Albeit the rich visual and auditory information in video data greatly enhances the accuracy of event localization models, the challenges, however, such as visual ambiguity, occlusion, small-scale targets, and sparse auditory features, hinder the acquisition of temporally continuous video segments with semantic consistency for events. To this end, we propose an event localization model with event-level multimodal feature fusion strategy to encode the event semantics consistency from video data, thereby improving the event localization accuracy. In particular, a multimodal features and distribution consistency loss is devised to train the spatial attention based architecture, along with the supervised loss, to fuse the multi-modal attended features to achieve the semantic consistency. To further mitigate the detrimental impact of outliers, i.e., the segments with non-relevant semantics, we propose to learn adaptive continuity sampling parameters to construct segment content sets with consistent semantics to the video, the experimental results demonstrate the advantages of our model against the existing event localization counterparts.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105610"},"PeriodicalIF":4.2000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001982","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Audio–visual event localization, by identifying audio–visual segments most relevant to semantics from long video sequences, has become a crucial prerequisite for applications such as video content understanding and editing. Albeit the rich visual and auditory information in video data greatly enhances the accuracy of event localization models, the challenges, however, such as visual ambiguity, occlusion, small-scale targets, and sparse auditory features, hinder the acquisition of temporally continuous video segments with semantic consistency for events. To this end, we propose an event localization model with event-level multimodal feature fusion strategy to encode the event semantics consistency from video data, thereby improving the event localization accuracy. In particular, a multimodal features and distribution consistency loss is devised to train the spatial attention based architecture, along with the supervised loss, to fuse the multi-modal attended features to achieve the semantic consistency. To further mitigate the detrimental impact of outliers, i.e., the segments with non-relevant semantics, we propose to learn adaptive continuity sampling parameters to construct segment content sets with consistent semantics to the video, the experimental results demonstrate the advantages of our model against the existing event localization counterparts.

查看原文本刊更多论文

面向视听事件定位的事件级多模态特征融合

通过从长视频序列中识别与语义最相关的视听片段，视听事件本地化已经成为视频内容理解和编辑等应用的关键先决条件。尽管视频数据中丰富的视觉和听觉信息极大地提高了事件定位模型的准确性，但视觉模糊、遮挡、小尺度目标和稀疏的听觉特征等挑战阻碍了获取具有事件语义一致性的时间连续视频片段。为此，我们提出了一种基于事件级多模态特征融合策略的事件定位模型，对视频数据中的事件语义一致性进行编码，从而提高了事件定位的精度。特别地，设计了一个多模态特征和分布一致性损失来训练基于空间注意的架构，并与监督损失一起融合多模态参与特征以实现语义一致性。为了进一步减轻异常值（即语义不相关的片段）的不利影响，我们提出学习自适应连续性采样参数来构建与视频语义一致的片段内容集，实验结果表明我们的模型相对于现有的事件定位模型具有优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.