Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman
{"title":"Separating the \"Chirp\" from the \"Chat\": Self-supervised Visual Grounding of Sound and Language","authors":"Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman","doi":"arxiv-2406.05629","DOIUrl":null,"url":null,"abstract":"We present DenseAV, a novel dual encoder grounding architecture that learns\nhigh-resolution, semantically meaningful, and audio-visually aligned features\nsolely through watching videos. We show that DenseAV can discover the\n``meaning'' of words and the ``location'' of sounds without explicit\nlocalization supervision. Furthermore, it automatically discovers and\ndistinguishes between these two types of associations without supervision. We\nshow that DenseAV's localization abilities arise from a new multi-head feature\naggregation operator that directly compares dense image and audio\nrepresentations for contrastive learning. In contrast, many other systems that\nlearn ``global'' audio and video representations cannot localize words and\nsound. Finally, we contribute two new datasets to improve the evaluation of AV\nrepresentations through speech and sound prompted semantic segmentation. On\nthese and other datasets we show DenseAV dramatically outperforms the prior art\non speech and sound prompted semantic segmentation. DenseAV outperforms the\nprevious state-of-the-art, ImageBind, on cross-modal retrieval using fewer than\nhalf of the parameters. Project Page:\n\\href{https://aka.ms/denseav}{https://aka.ms/denseav}","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We present DenseAV, a novel dual encoder grounding architecture that learns
high-resolution, semantically meaningful, and audio-visually aligned features
solely through watching videos. We show that DenseAV can discover the
``meaning'' of words and the ``location'' of sounds without explicit
localization supervision. Furthermore, it automatically discovers and
distinguishes between these two types of associations without supervision. We
show that DenseAV's localization abilities arise from a new multi-head feature
aggregation operator that directly compares dense image and audio
representations for contrastive learning. In contrast, many other systems that
learn ``global'' audio and video representations cannot localize words and
sound. Finally, we contribute two new datasets to improve the evaluation of AV
representations through speech and sound prompted semantic segmentation. On
these and other datasets we show DenseAV dramatically outperforms the prior art
on speech and sound prompted semantic segmentation. DenseAV outperforms the
previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than
half of the parameters. Project Page:
\href{https://aka.ms/denseav}{https://aka.ms/denseav}