End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-19 DOI:10.1109/SLT54892.2023.10023199

Yoshiki Masuyama, Xuankai Chang, Samuele Cornell, Shinji Watanabe, Nobutaka Ono

引用次数: 9

Abstract

Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset.

查看原文本刊更多论文

语音识别的端到端集成，去噪，波束成形，和自监督学习表示

自监督学习表示(Self-supervised learning representation, SSLR)在自动语音识别(automatic speech recognition, ASR)中已经证明了其显著的有效性，尤其是在干净的语音中。最近的工作指出了将SSLR与单通道语音增强集成在噪声环境中的ASR的强度。本文通过处理多通道输入进一步推进了这种集成。我们提出了一种新颖的端到端架构，将去噪、波束形成、SSLR和ASR集成在一个神经网络中。该系统在CHiME-4六声道声道上实现了文献报道的最佳性能，字错误率(WER)为1.77%。虽然基于wavlm的强SSLR本身显示出令人满意的结果，但与加权功率最小化无失真响应波束形成器的端到端集成同时执行去噪和去噪，显着提高了WER。在REVERB数据集上也验证了它的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量