Pufen Zhang, Jiaxiang Wang, Meng Wan, Song Zhang, Jie Jing, Lianhong Ding, Peng Shi
{"title":"利用双时间感知场景理解和图像-文本知识桥接进行视听事件定位","authors":"Pufen Zhang, Jiaxiang Wang, Meng Wan, Song Zhang, Jie Jing, Lianhong Ding, Peng Shi","doi":"10.1007/s40747-024-01654-2","DOIUrl":null,"url":null,"abstract":"<p>Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"34 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging\",\"authors\":\"Pufen Zhang, Jiaxiang Wang, Meng Wan, Song Zhang, Jie Jing, Lianhong Ding, Peng Shi\",\"doi\":\"10.1007/s40747-024-01654-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.</p>\",\"PeriodicalId\":10524,\"journal\":{\"name\":\"Complex & Intelligent Systems\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex & Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s40747-024-01654-2\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01654-2","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging
Audio-visual event localization (AVEL) task aims to judge and classify an audible and visible event. Existing methods devote to this goal by transferring pre-trained knowledge as well as understanding temporal dependencies and cross-modal correlations of the audio-visual scene. However, most works comprehend the audio-visual scene from an entangled temporal-aware perspective, ignoring the learning of temporal dependency and cross-modal correlation in both forward and backward temporal-aware views. Recently, transferring the pre-trained knowledge from Contrastive Language-Image Pre-training model (CLIP) has shown remarkable results across various tasks. Nevertheless, since audio-visual knowledge of the AVEL task and image-text alignment knowledge of the CLIP exist heterogeneous gap, how to transfer the image-text alignment knowledge of CLIP into AVEL field has barely been investigated. To address these challenges, a novel Dual Temporal-aware scene understanding and image-text Knowledge Bridging (DTKB) model is proposed in this paper. DTKB consists of forward and backward temporal-aware scene understanding streams, in which temporal dependencies and cross-modal correlations are explicitly captured from dual temporal-aware perspectives. Consequently, DTKB can achieve fine-grained scene understanding for event localization. Additionally, a knowledge bridging (KB) module is proposed to simultaneously transfer the image-text representation and alignment knowledge of CLIP to AVEL task. This module regulates the ratio between audio-visual fusion features and CLIP’s visual features, thereby bridging the image-text alignment knowledge of CLIP and the audio-visual new knowledge for event category prediction. Besides, the KB module is compatible with previous models. Extensive experimental results demonstrate that DTKB significantly outperforms the state-of-the-arts models.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.