Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-07-06 DOI:10.1016/j.csl.2024.101685

Simon Leglaive , Matthieu Fraticelli , Hend ElGhazaly , Léonie Borne , Mostafa Sadeghi , Scott Wisdom , Manuel Pariente , John R. Hershey , Daniel Pressnitzer , Jon P. Barker

{"title":"Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge","authors":"Simon Leglaive , Matthieu Fraticelli , Hend ElGhazaly , Léonie Borne , Mostafa Sadeghi , Scott Wisdom , Manuel Pariente , John R. Hershey , Daniel Pressnitzer , Jon P. Barker","doi":"10.1016/j.csl.2024.101685","DOIUrl":null,"url":null,"abstract":"<div><p>Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101685"},"PeriodicalIF":3.1000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000688/pdfft?md5=8f9da64ecc09fa13d3d77b048c8fa3ae&pid=1-s2.0-S0885230824000688-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000688","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.

查看原文本刊更多论文

第 7 届 CHiME 挑战赛 UDASE 任务中对语音增强方法的客观和主观评估

用于语音增强的监督模型是利用人工生成的干净语音和噪声信号混合物进行训练的。然而，合成训练条件可能无法准确反映测试过程中遇到的实际情况。当测试域与合成训练域有显著差异时，这种差异会导致性能低下。为了解决这个问题，第七届 CHiME 挑战赛的 UDASE 任务旨在利用来自测试域的真实世界噪声语音记录，对语音增强模型进行无监督域适应。具体来说，该测试域与 CHiME-5 数据集相对应，其特点是在嘈杂和混响的家庭环境中录制的真实多讲话者会话语音记录，而这些记录无法获得地面真实的干净语音信号。在本文中，我们介绍了提交给 CHiME-7 UDASE 任务的系统的客观和主观评价，并对结果进行了分析。分析表明，主观评价与最近提出的几种用于语音增强的有监督非侵入式性能指标之间的相关性有限。相反，结果表明，使用为挑战赛开发的混响LibriCHiME-5数据集，更传统的侵入式客观指标可用于域内性能评估。主观评估结果表明，所有系统都成功降低了背景噪声，但总是以增加失真为代价。在主观评估的四种语音增强方法中，只有一种与未经处理的噪声语音相比，整体质量有所提高，这凸显了这项任务的难度。为 CHiME-7 UDASE 任务创建的工具和音频资料已与社区共享。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.