Learning Spatially-Aware Language and Audio Embedding

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-17 DOI:arxiv-2409.11369

Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia

{"title":"Learning Spatially-Aware Language and Audio Embedding","authors":"Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia","doi":"arxiv-2409.11369","DOIUrl":null,"url":null,"abstract":"Humans can picture a sound scene given an imprecise natural language\ndescription. For example, it is easy to imagine an acoustic environment given a\nphrase like \"the lion roar came from right behind me!\". For a machine to have\nthe same degree of comprehension, the machine must know what a lion is\n(semantic attribute), what the concept of \"behind\" is (spatial attribute) and\nhow these pieces of linguistic information align with the semantic and spatial\nattributes of the sound (what a roar sounds like when its coming from behind).\nState-of-the-art audio foundation models which learn to map between audio\nscenes and natural textual descriptions, are trained on non-spatial audio and\ntext pairs, and hence lack spatial awareness. In contrast, sound event\nlocalization and detection models are limited to recognizing sounds from a\nfixed number of classes, and they localize the source to absolute position\n(e.g., 0.2m) rather than a position described using natural language (e.g.,\n\"next to me\"). To address these gaps, we present ELSA a spatially aware-audio\nand text embedding model trained using multimodal contrastive learning. ELSA\nsupports non-spatial audio, spatial audio, and open vocabulary text captions\ndescribing both the spatial and semantic components of sound. To train ELSA:\n(a) we spatially augment the audio and captions of three open-source audio\ndatasets totaling 4,738 hours of audio, and (b) we design an encoder to capture\nthe semantics of non-spatial audio, and the semantics and spatial attributes of\nspatial audio using contrastive learning. ELSA is competitive with\nstate-of-the-art for both semantic retrieval and 3D source localization. In\nparticular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above\nthe baseline, and outperforms by -11.6{\\deg} mean-absolute-error in 3D source\nlocalization over the baseline.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me"). To address these gaps, we present ELSA a spatially aware-audio and text embedding model trained using multimodal contrastive learning. ELSA supports non-spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open-source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non-spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state-of-the-art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the baseline, and outperforms by -11.6{\deg} mean-absolute-error in 3D source localization over the baseline.

查看原文本刊更多论文

学习空间感知语言和音频嵌入

人类可以通过不精确的自然语言描述来想象声音场景。例如，用 "狮子的吼叫声从我身后传来！"这样的短语很容易想象出声音环境。要让机器具有相同程度的理解能力，机器必须知道狮子是什么（语义属性），"后面 "是什么概念（空间属性），以及这些语言信息如何与声音的语义和空间属性相一致（当吼声从后面传来时听起来像什么）。与此相反，声音事件定位和检测模型仅限于识别固定数量类别的声音，它们将声源定位到绝对位置（如 0.2 米），而不是使用自然语言描述的位置（如 "在我旁边"）。为了弥补这些不足，我们推出了 ELSA，这是一种利用多模态对比学习训练的空间感知音频和文本嵌入模型。ELSA 支持非空间音频、空间音频以及描述声音空间和语义成分的开放词汇文本标题。为了训练 ELSA：（a）我们对三个开源音频数据集共 4738 小时的音频和字幕进行了空间增强；（b）我们设计了一个编码器，利用对比学习捕捉非空间音频的语义以及空间音频的语义和空间属性。在语义检索和三维声源定位方面，ELSA 都能与最先进的技术相媲美。特别是，ELSA的音频到文本和文本到音频平均R@1比基线高出2.8%，在三维音源定位方面比基线高出-11.6{/deg}平均绝对误差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量