基于自监督学习表征的语音分离与识别端到端集成

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-05-14 DOI:10.1016/j.csl.2025.101813

Yoshiki Masuyama , Xuankai Chang , Wangyou Zhang , Samuele Cornell , Zhong-Qiu Wang , Nobutaka Ono , Yanmin Qian , Shinji Watanabe

{"title":"基于自监督学习表征的语音分离与识别端到端集成","authors":"Yoshiki Masuyama , Xuankai Chang , Wangyou Zhang , Samuele Cornell , Zhong-Qiu Wang , Nobutaka Ono , Yanmin Qian , Shinji Watanabe","doi":"10.1016/j.csl.2025.101813","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-speaker automatic speech recognition (ASR) has gained growing attention in a wide range of applications, including conversation analysis and human–computer interaction. Speech separation and enhancement (SSE) and single-speaker ASR have witnessed remarkable performance improvements with the rapid advances in deep learning. Complex spectral mapping predicts the short-time Fourier transform (STFT) coefficients of each speaker and has achieved promising results in several SSE benchmarks. Meanwhile, self-supervised learning representation (SSLR) has demonstrated its significant advantage in single-speaker ASR. In this work, we push forward the performance of multi-speaker ASR under noisy reverberant conditions by integrating powerful SSE, SSL, and ASR models in an end-to-end manner. We systematically investigate both monaural and multi-channel SSE methods and various feature representations. Our experiments demonstrate the advantages of recently proposed complex spectral mapping and SSLRs in multi-speaker ASR. The experimental results also confirm that end-to-end fine-tuning with an ASR criterion is important to achieve state-of-the-art word error rates (WERs) even with powerful pre-trained models. Moreover, we show the performance trade-off between SSE and ASE and mitigate it with a multi-task learning framework with both SSE and ASR criteria.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101813"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An end-to-end integration of speech separation and recognition with self-supervised learning representation\",\"authors\":\"Yoshiki Masuyama , Xuankai Chang , Wangyou Zhang , Samuele Cornell , Zhong-Qiu Wang , Nobutaka Ono , Yanmin Qian , Shinji Watanabe\",\"doi\":\"10.1016/j.csl.2025.101813\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-speaker automatic speech recognition (ASR) has gained growing attention in a wide range of applications, including conversation analysis and human–computer interaction. Speech separation and enhancement (SSE) and single-speaker ASR have witnessed remarkable performance improvements with the rapid advances in deep learning. Complex spectral mapping predicts the short-time Fourier transform (STFT) coefficients of each speaker and has achieved promising results in several SSE benchmarks. Meanwhile, self-supervised learning representation (SSLR) has demonstrated its significant advantage in single-speaker ASR. In this work, we push forward the performance of multi-speaker ASR under noisy reverberant conditions by integrating powerful SSE, SSL, and ASR models in an end-to-end manner. We systematically investigate both monaural and multi-channel SSE methods and various feature representations. Our experiments demonstrate the advantages of recently proposed complex spectral mapping and SSLRs in multi-speaker ASR. The experimental results also confirm that end-to-end fine-tuning with an ASR criterion is important to achieve state-of-the-art word error rates (WERs) even with powerful pre-trained models. Moreover, we show the performance trade-off between SSE and ASE and mitigate it with a multi-task learning framework with both SSE and ASR criteria.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"95 \",\"pages\":\"Article 101813\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230825000385\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000385","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

多扬声器自动语音识别（ASR）在会话分析和人机交互等领域的应用越来越受到人们的关注。随着深度学习的快速发展，语音分离和增强（SSE）和单说话人语音识别（ASR）的性能得到了显著提高。复频谱映射预测了每个扬声器的短时傅里叶变换（STFT）系数，并在几个SSE基准测试中取得了令人满意的结果。与此同时，自监督学习表征（SSLR）在单语ASR中也显示出了显著的优势。在这项工作中，我们通过集成强大的SSE、SSL和ASR模型，以端到端方式提高了多扬声器ASR在嘈杂混响条件下的性能。我们系统地研究了单通道和多通道SSE方法以及各种特征表示。我们的实验证明了最近提出的复杂频谱映射和sslr在多扬声器ASR中的优势。实验结果还证实，即使使用强大的预训练模型，基于ASR标准的端到端微调对于实现最先进的单词错误率（wer）也很重要。此外，我们展示了SSE和ASE之间的性能权衡，并使用具有SSE和ASR标准的多任务学习框架来缓解它。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An end-to-end integration of speech separation and recognition with self-supervised learning representation

Multi-speaker automatic speech recognition (ASR) has gained growing attention in a wide range of applications, including conversation analysis and human–computer interaction. Speech separation and enhancement (SSE) and single-speaker ASR have witnessed remarkable performance improvements with the rapid advances in deep learning. Complex spectral mapping predicts the short-time Fourier transform (STFT) coefficients of each speaker and has achieved promising results in several SSE benchmarks. Meanwhile, self-supervised learning representation (SSLR) has demonstrated its significant advantage in single-speaker ASR. In this work, we push forward the performance of multi-speaker ASR under noisy reverberant conditions by integrating powerful SSE, SSL, and ASR models in an end-to-end manner. We systematically investigate both monaural and multi-channel SSE methods and various feature representations. Our experiments demonstrate the advantages of recently proposed complex spectral mapping and SSLRs in multi-speaker ASR. The experimental results also confirm that end-to-end fine-tuning with an ASR criterion is important to achieve state-of-the-art word error rates (WERs) even with powerful pre-trained models. Moreover, we show the performance trade-off between SSE and ASE and mitigate it with a multi-task learning framework with both SSE and ASR criteria.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.