{"title":"Event-level multimodal feature fusion for audio–visual event localization","authors":"Jing Zhang, Yi Yu, Yuyao Mao, Yonggong Ren","doi":"10.1016/j.imavis.2025.105610","DOIUrl":null,"url":null,"abstract":"<div><div>Audio–visual event localization, by identifying audio–visual segments most relevant to semantics from long video sequences, has become a crucial prerequisite for applications such as video content understanding and editing. Albeit the rich visual and auditory information in video data greatly enhances the accuracy of event localization models, the challenges, however, such as visual ambiguity, occlusion, small-scale targets, and sparse auditory features, hinder the acquisition of temporally continuous video segments with semantic consistency for events. To this end, we propose an event localization model with event-level multimodal feature fusion strategy to encode the event semantics consistency from video data, thereby improving the event localization accuracy. In particular, a multimodal features and distribution consistency loss is devised to train the spatial attention based architecture, along with the supervised loss, to fuse the multi-modal attended features to achieve the semantic consistency. To further mitigate the detrimental impact of outliers, i.e., the segments with non-relevant semantics, we propose to learn adaptive continuity sampling parameters to construct segment content sets with consistent semantics to the video, the experimental results demonstrate the advantages of our model against the existing event localization counterparts.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105610"},"PeriodicalIF":4.2000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001982","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Audio–visual event localization, by identifying audio–visual segments most relevant to semantics from long video sequences, has become a crucial prerequisite for applications such as video content understanding and editing. Albeit the rich visual and auditory information in video data greatly enhances the accuracy of event localization models, the challenges, however, such as visual ambiguity, occlusion, small-scale targets, and sparse auditory features, hinder the acquisition of temporally continuous video segments with semantic consistency for events. To this end, we propose an event localization model with event-level multimodal feature fusion strategy to encode the event semantics consistency from video data, thereby improving the event localization accuracy. In particular, a multimodal features and distribution consistency loss is devised to train the spatial attention based architecture, along with the supervised loss, to fuse the multi-modal attended features to achieve the semantic consistency. To further mitigate the detrimental impact of outliers, i.e., the segments with non-relevant semantics, we propose to learn adaptive continuity sampling parameters to construct segment content sets with consistent semantics to the video, the experimental results demonstrate the advantages of our model against the existing event localization counterparts.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.