{"title":"Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization","authors":"Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang","doi":"arxiv-2409.07967","DOIUrl":null,"url":null,"abstract":"Dense-localization Audio-Visual Events (DAVE) aims to identify time\nboundaries and corresponding categories for events that can be heard and seen\nconcurrently in an untrimmed video. Existing methods typically encode audio and\nvisual representation separately without any explicit cross-modal alignment\nconstraint. Then they adopt dense cross-modal attention to integrate multimodal\ninformation for DAVE. Thus these methods inevitably aggregate irrelevant noise\nand events, especially in complex and long videos, leading to imprecise\ndetection. In this paper, we present LOCO, a Locality-aware cross-modal\nCorrespondence learning framework for DAVE. The core idea is to explore local\ntemporal continuity nature of audio-visual events, which serves as informative\nyet free supervision signals to guide the filtering of irrelevant information\nand inspire the extraction of complementary multimodal information during both\nunimodal and cross-modal learning stages. i) Specifically, LOCO applies\nLocality-aware Correspondence Correction (LCC) to uni-modal features via\nleveraging cross-modal local-correlated properties without any extra\nannotations. This enforces uni-modal encoders to highlight similar semantics\nshared by audio and visual features. ii) To better aggregate such audio and\nvisual features, we further customize Cross-modal Dynamic Perception layer\n(CDP) in cross-modal feature pyramid to understand local temporal patterns of\naudio-visual events by imposing local consistency within multimodal features in\na data-driven manner. By incorporating LCC and CDP, LOCO provides solid\nperformance gains and outperforms existing methods for DAVE. The source code\nwill be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Dense-localization Audio-Visual Events (DAVE) aims to identify time
boundaries and corresponding categories for events that can be heard and seen
concurrently in an untrimmed video. Existing methods typically encode audio and
visual representation separately without any explicit cross-modal alignment
constraint. Then they adopt dense cross-modal attention to integrate multimodal
information for DAVE. Thus these methods inevitably aggregate irrelevant noise
and events, especially in complex and long videos, leading to imprecise
detection. In this paper, we present LOCO, a Locality-aware cross-modal
Correspondence learning framework for DAVE. The core idea is to explore local
temporal continuity nature of audio-visual events, which serves as informative
yet free supervision signals to guide the filtering of irrelevant information
and inspire the extraction of complementary multimodal information during both
unimodal and cross-modal learning stages. i) Specifically, LOCO applies
Locality-aware Correspondence Correction (LCC) to uni-modal features via
leveraging cross-modal local-correlated properties without any extra
annotations. This enforces uni-modal encoders to highlight similar semantics
shared by audio and visual features. ii) To better aggregate such audio and
visual features, we further customize Cross-modal Dynamic Perception layer
(CDP) in cross-modal feature pyramid to understand local temporal patterns of
audio-visual events by imposing local consistency within multimodal features in
a data-driven manner. By incorporating LCC and CDP, LOCO provides solid
performance gains and outperforms existing methods for DAVE. The source code
will be released.