SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-378

Yuhang He, A. Markham

{"title":"SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms","authors":"Yuhang He, A. Markham","doi":"10.21437/interspeech.2022-378","DOIUrl":null,"url":null,"abstract":"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2408-2412"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.

查看原文本刊更多论文

SoundDoA：从声音原始波形中学习声源到达方向和语义

智能体声学理解环境的一项基本任务是检测声源位置（如到达方向（DoA））和语义标签。这是一项具有挑战性的任务：首先，声源在时间、频率和空间上重叠；其次，虽然语义在很大程度上是通过时频能量（幅度）轮廓来传达的，但DoA是在信道间相位差中编码的；最后，尽管麦克风传感器的数量是稀疏的，但由于高采样率，记录的声音波形在时间上是密集的。现有的DoA预测方法大多依赖于预先提取的2D声学特征，如GCC-PHAT和Mel声谱图，以受益于成熟的基于2D图像的深度神经网络的成功。相反，我们提出了一种新的端到端可训练框架，名为SoundDoA，它能够直接从声音原始波形中学习声源DoA和语义。我们首先使用可学习的前端滤波器组将声源语义和DoA相关特征动态编码为紧凑表示。然后，提出了一个由两个相同的子网络组成的骨干网络，采用分层通信策略来进一步单独和联合学习语义标签和DoA。最后，添加了一个排列不变的多轨头来回归DoA并对语义标签进行分类。在DCASE 2020声音事件检测和定位数据集（SELD）上的大量实验结果表明，与其他现有方法相比，SoundDoA具有优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interspeech

自引率

0.00%

发文量