Multi-channel Environmental Sound Segmentation utilizing Sound Source Localization and Separation U-Net

2021 IEEE/SICE International Symposium on System Integration (SII) Pub Date : 2021-01-11 DOI:10.1109/IEEECONF49454.2021.9382730

Yui Sudo, Katsutoshi Itoyama, Kenji Nishida, K. Nakadai

{"title":"Multi-channel Environmental Sound Segmentation utilizing Sound Source Localization and Separation U-Net","authors":"Yui Sudo, Katsutoshi Itoyama, Kenji Nishida, K. Nakadai","doi":"10.1109/IEEECONF49454.2021.9382730","DOIUrl":null,"url":null,"abstract":"This paper proposes a multi-channel environmental sound segmentation method. Environmental sound segmentation is an integrated method that deals with sound source localization, sound source separation and class identification. When multiple microphones are available, spatial features can be used to improve the separation accuracy of signals from different directions; however, conventional methods have two drawbacks: (a) Since sound source localization and sound source separation using spatial features and class identification using spectral features are trained in the same neural network, it overfits to the relationship between the direction of arrival and the class. (b) Although the permutation invariant training used in speech recognition could be extended, it is not practical for environmental sounds due to the maximum number of speakers limitation. This paper proposes multi-channel environmental sound segmentation method that combines U-Net which simultaneously performs sound source localization and sound source separation, and convolutional neural network which classifies the separated sounds. This method prevents overfitting to the relationship between the direction of arrival and the class. Simulation experiments using the created datasets including 75-class environmental sounds showed that the root mean squared error of the proposed method was lower than that of the conventional method.","PeriodicalId":395378,"journal":{"name":"2021 IEEE/SICE International Symposium on System Integration (SII)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/SICE International Symposium on System Integration (SII)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IEEECONF49454.2021.9382730","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

This paper proposes a multi-channel environmental sound segmentation method. Environmental sound segmentation is an integrated method that deals with sound source localization, sound source separation and class identification. When multiple microphones are available, spatial features can be used to improve the separation accuracy of signals from different directions; however, conventional methods have two drawbacks: (a) Since sound source localization and sound source separation using spatial features and class identification using spectral features are trained in the same neural network, it overfits to the relationship between the direction of arrival and the class. (b) Although the permutation invariant training used in speech recognition could be extended, it is not practical for environmental sounds due to the maximum number of speakers limitation. This paper proposes multi-channel environmental sound segmentation method that combines U-Net which simultaneously performs sound source localization and sound source separation, and convolutional neural network which classifies the separated sounds. This method prevents overfitting to the relationship between the direction of arrival and the class. Simulation experiments using the created datasets including 75-class environmental sounds showed that the root mean squared error of the proposed method was lower than that of the conventional method.

查看原文本刊更多论文

基于声源定位和分离U-Net的多通道环境声分割

提出了一种多通道环境声分割方法。环境声分割是一种综合处理声源定位、声源分离和声源分类的方法。当有多个麦克风时，可以利用空间特征提高不同方向信号的分离精度;然而，传统的方法存在两个缺点:(a)由于利用空间特征的声源定位和声源分离与利用频谱特征的类识别是在同一个神经网络中训练的，因此对到达方向与类之间的关系进行了过拟合。(b)虽然语音识别中使用的排列不变量训练可以扩展，但由于最大说话人数量的限制，它不适用于环境声音。本文提出了一种结合U-Net(同时进行声源定位和声源分离)和卷积神经网络(对分离后的声音进行分类)的多通道环境声音分割方法。这种方法防止了对到达方向和类之间关系的过度拟合。在75类环境声音数据集上进行的仿真实验表明，该方法的均方根误差低于常规方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/SICE International Symposium on System Integration (SII)

自引率

0.00%

发文量