Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia
{"title":"Learning Spatially-Aware Language and Audio Embedding","authors":"Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia","doi":"arxiv-2409.11369","DOIUrl":null,"url":null,"abstract":"Humans can picture a sound scene given an imprecise natural language\ndescription. For example, it is easy to imagine an acoustic environment given a\nphrase like \"the lion roar came from right behind me!\". For a machine to have\nthe same degree of comprehension, the machine must know what a lion is\n(semantic attribute), what the concept of \"behind\" is (spatial attribute) and\nhow these pieces of linguistic information align with the semantic and spatial\nattributes of the sound (what a roar sounds like when its coming from behind).\nState-of-the-art audio foundation models which learn to map between audio\nscenes and natural textual descriptions, are trained on non-spatial audio and\ntext pairs, and hence lack spatial awareness. In contrast, sound event\nlocalization and detection models are limited to recognizing sounds from a\nfixed number of classes, and they localize the source to absolute position\n(e.g., 0.2m) rather than a position described using natural language (e.g.,\n\"next to me\"). To address these gaps, we present ELSA a spatially aware-audio\nand text embedding model trained using multimodal contrastive learning. ELSA\nsupports non-spatial audio, spatial audio, and open vocabulary text captions\ndescribing both the spatial and semantic components of sound. To train ELSA:\n(a) we spatially augment the audio and captions of three open-source audio\ndatasets totaling 4,738 hours of audio, and (b) we design an encoder to capture\nthe semantics of non-spatial audio, and the semantics and spatial attributes of\nspatial audio using contrastive learning. ELSA is competitive with\nstate-of-the-art for both semantic retrieval and 3D source localization. In\nparticular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above\nthe baseline, and outperforms by -11.6{\\deg} mean-absolute-error in 3D source\nlocalization over the baseline.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Humans can picture a sound scene given an imprecise natural language
description. For example, it is easy to imagine an acoustic environment given a
phrase like "the lion roar came from right behind me!". For a machine to have
the same degree of comprehension, the machine must know what a lion is
(semantic attribute), what the concept of "behind" is (spatial attribute) and
how these pieces of linguistic information align with the semantic and spatial
attributes of the sound (what a roar sounds like when its coming from behind).
State-of-the-art audio foundation models which learn to map between audio
scenes and natural textual descriptions, are trained on non-spatial audio and
text pairs, and hence lack spatial awareness. In contrast, sound event
localization and detection models are limited to recognizing sounds from a
fixed number of classes, and they localize the source to absolute position
(e.g., 0.2m) rather than a position described using natural language (e.g.,
"next to me"). To address these gaps, we present ELSA a spatially aware-audio
and text embedding model trained using multimodal contrastive learning. ELSA
supports non-spatial audio, spatial audio, and open vocabulary text captions
describing both the spatial and semantic components of sound. To train ELSA:
(a) we spatially augment the audio and captions of three open-source audio
datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture
the semantics of non-spatial audio, and the semantics and spatial attributes of
spatial audio using contrastive learning. ELSA is competitive with
state-of-the-art for both semantic retrieval and 3D source localization. In
particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above
the baseline, and outperforms by -11.6{\deg} mean-absolute-error in 3D source
localization over the baseline.