SoundDoA:从声音原始波形中学习声源到达方向和语义

Yuhang He, A. Markham
{"title":"SoundDoA:从声音原始波形中学习声源到达方向和语义","authors":"Yuhang He, A. Markham","doi":"10.21437/interspeech.2022-378","DOIUrl":null,"url":null,"abstract":"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2408-2412"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms\",\"authors\":\"Yuhang He, A. Markham\",\"doi\":\"10.21437/interspeech.2022-378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"2408-2412\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

智能体声学理解环境的一项基本任务是检测声源位置(如到达方向(DoA))和语义标签。这是一项具有挑战性的任务:首先,声源在时间、频率和空间上重叠;其次,虽然语义在很大程度上是通过时频能量(幅度)轮廓来传达的,但DoA是在信道间相位差中编码的;最后,尽管麦克风传感器的数量是稀疏的,但由于高采样率,记录的声音波形在时间上是密集的。现有的DoA预测方法大多依赖于预先提取的2D声学特征,如GCC-PHAT和Mel声谱图,以受益于成熟的基于2D图像的深度神经网络的成功。相反,我们提出了一种新的端到端可训练框架,名为SoundDoA,它能够直接从声音原始波形中学习声源DoA和语义。我们首先使用可学习的前端滤波器组将声源语义和DoA相关特征动态编码为紧凑表示。然后,提出了一个由两个相同的子网络组成的骨干网络,采用分层通信策略来进一步单独和联合学习语义标签和DoA。最后,添加了一个排列不变的多轨头来回归DoA并对语义标签进行分类。在DCASE 2020声音事件检测和定位数据集(SELD)上的大量实验结果表明,与其他现有方法相比,SoundDoA具有优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms
A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信