GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration

IF 1.9 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-06-26 DOI:10.1186/s13636-024-00356-4

Mengzhen Ma, Ying Hu, Liang He, Hao Huang

{"title":"GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration","authors":"Mengzhen Ma, Ying Hu, Liang He, Hao Huang","doi":"10.1186/s13636-024-00356-4","DOIUrl":null,"url":null,"abstract":"Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"94 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00356-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.

查看原文本刊更多论文

GLFER-Net：基于全局-局部特征提取和重新校准的复调声源定位和检测网络

复调声源定位和检测（SSLD）任务旨在识别声音事件的类别、确定其起始和偏移时间，并检测其相应的到达方向（DOA），其中复调指的是在一个片段中出现多个重叠声源。然而，基于卷积递归神经网络（CRNN）的传统 SSLD 方法存在特征提取不足的问题。CRNN 中的卷积核为单尺度，无法充分提取具有不同时频特征的声音事件的多尺度特征。这导致提取的特征缺乏有助于声源定位的细粒度信息。为了应对这些挑战，我们提出了一种基于全局本地特征提取和重新校准的多声道 SSLD 网络（GLFER-Net），其中全局本地特征提取器（GLF）通过全向动态卷积（ODConv）层和多尺度特征提取（MSFE）模块提取多尺度全局特征。局部特征提取（LFE）单元用于捕捉细节信息。此外，我们还设计了一个特征重新校准（FR）模块，以强调多个维度上的关键特征。在 DCASE 2021 年和 2022 年挑战赛任务 3 的开放数据集上，我们将所提出的 GLFER-Net 分别与六种和四种 SSLD 方法进行了比较。结果表明，GLFER-Net 的性能极具竞争力。通过一系列消融实验和可视化分析，我们验证了所设计模块的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.