Speech Localization at Low Bitrates in Wireless Acoustics Sensor Networks

IF 1.3 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC

Frontiers in signal processing Pub Date : 2022-03-17 DOI:10.3389/frsip.2022.800003

Mariem Bouafif Mansali, Pablo Pérez Zarazaga, Tom Bäckström, Z. Lachiri

{"title":"Speech Localization at Low Bitrates in Wireless Acoustics Sensor Networks","authors":"Mariem Bouafif Mansali, Pablo Pérez Zarazaga, Tom Bäckström, Z. Lachiri","doi":"10.3389/frsip.2022.800003","DOIUrl":null,"url":null,"abstract":"The use of speech source localization (SSL) and its applications offer great possibilities for the design of speaker local positioning systems with wireless acoustic sensor networks (WASNs). Recent works have shown that data-driven front-ends can outperform traditional algorithms for SSL when trained to work in specific domains, depending on factors like reverberation and noise levels. However, such localization models consider localization directly from raw sensor observations, without consideration for transmission losses in WASNs. In contrast, when sensors reside in separate real-life devices, we need to quantize, encode and transmit sensor data, decreasing the performance of localization, especially when the transmission bitrate is low. In this work, we investigate the effect of low bitrate transmission on a Direction of Arrival (DoA) estimator. We analyze a deep neural network (DNN) based framework performance as a function of the audio encoding bitrate for compressed signals by employing recent communication codecs including PyAWNeS, Opus, EVS, and Lyra. Experimental results show that training the DNN on input encoded with the PyAWNeS codec at 16.4 kB/s can improve the accuracy significantly, and up to 50% of accuracy degradation at a low bitrate for almost all codecs can be recovered. Our results further show that for the best accuracy of the trained model when one of the two channels can be encoded with a bitrate higher than 32 kB/s, it is optimal to have the raw data for the second channel. However, for a lower bitrate, it is preferable to similarly encode the two channels. More importantly, for practical applications, a more generalized model trained with a randomly selected codec for each channel, shows a large accuracy gain when at least one of the two channels is encoded with PyAWNeS.","PeriodicalId":93557,"journal":{"name":"Frontiers in signal processing","volume":"31 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in signal processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frsip.2022.800003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The use of speech source localization (SSL) and its applications offer great possibilities for the design of speaker local positioning systems with wireless acoustic sensor networks (WASNs). Recent works have shown that data-driven front-ends can outperform traditional algorithms for SSL when trained to work in specific domains, depending on factors like reverberation and noise levels. However, such localization models consider localization directly from raw sensor observations, without consideration for transmission losses in WASNs. In contrast, when sensors reside in separate real-life devices, we need to quantize, encode and transmit sensor data, decreasing the performance of localization, especially when the transmission bitrate is low. In this work, we investigate the effect of low bitrate transmission on a Direction of Arrival (DoA) estimator. We analyze a deep neural network (DNN) based framework performance as a function of the audio encoding bitrate for compressed signals by employing recent communication codecs including PyAWNeS, Opus, EVS, and Lyra. Experimental results show that training the DNN on input encoded with the PyAWNeS codec at 16.4 kB/s can improve the accuracy significantly, and up to 50% of accuracy degradation at a low bitrate for almost all codecs can be recovered. Our results further show that for the best accuracy of the trained model when one of the two channels can be encoded with a bitrate higher than 32 kB/s, it is optimal to have the raw data for the second channel. However, for a lower bitrate, it is preferable to similarly encode the two channels. More importantly, for practical applications, a more generalized model trained with a randomly selected codec for each channel, shows a large accuracy gain when at least one of the two channels is encoded with PyAWNeS.

查看原文本刊更多论文

无线声学传感器网络低比特率语音定位

语音源定位技术及其应用为无线声传感器网络扬声器局部定位系统的设计提供了极大的可能性。最近的研究表明，根据混响和噪声水平等因素，在特定领域进行训练时，数据驱动的前端可以胜过SSL的传统算法。然而，这种定位模型直接从原始传感器观测中考虑定位，而不考虑wasn中的传输损失。相反，当传感器位于独立的现实设备中时，我们需要量化、编码和传输传感器数据，这降低了定位的性能，特别是当传输比特率很低时。在这项工作中，我们研究了低比特率传输对到达方向(DoA)估计器的影响。我们通过使用最新的通信编解码器(包括PyAWNeS, Opus, EVS和Lyra)，分析了基于深度神经网络(DNN)的框架性能作为压缩信号音频编码比特率的函数。实验结果表明，在PyAWNeS编解码器编码的输入上以16.4 kB/s的速度训练DNN可以显著提高准确率，并且几乎所有编解码器在低比特率下都可以恢复高达50%的准确率下降。我们的结果进一步表明，当两个通道中的一个可以以高于32 kB/s的比特率进行编码时，为了获得训练模型的最佳精度，最佳方法是使用第二个通道的原始数据。然而，对于较低的比特率，最好对两个通道进行类似的编码。更重要的是，对于实际应用，使用随机选择的编解码器对每个通道进行训练的更广义的模型显示，当两个通道中至少有一个使用PyAWNeS进行编码时，精度增益很大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in signal processing

自引率

0.00%

发文量