Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
{"title":"SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos","authors":"Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman","doi":"arxiv-2404.05206","DOIUrl":null,"url":null,"abstract":"We propose a novel self-supervised embedding to learn how actions sound from\nnarrated in-the-wild egocentric videos. Whereas existing methods rely on\ncurated data with known audio-visual correspondence, our multimodal\ncontrastive-consensus coding (MC3) embedding reinforces the associations\nbetween audio, language, and vision when all modality pairs agree, while\ndiminishing those associations when any one pair does not. We show our approach\ncan successfully discover how the long tail of human actions sound from\negocentric video, outperforming an array of recent multimodal embedding\ntechniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal\ntasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.05206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We propose a novel self-supervised embedding to learn how actions sound from
narrated in-the-wild egocentric videos. Whereas existing methods rely on
curated data with known audio-visual correspondence, our multimodal
contrastive-consensus coding (MC3) embedding reinforces the associations
between audio, language, and vision when all modality pairs agree, while
diminishing those associations when any one pair does not. We show our approach
can successfully discover how the long tail of human actions sound from
egocentric video, outperforming an array of recent multimodal embedding
techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal
tasks.