Automatic speech recognition in the presence of babble noise and reverberation compared to human speech intelligibility in Spanish

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-06-24 DOI:10.1016/j.csl.2025.101856

Carlos Mena , A.L. Padilla-Ortiz , Felipe Orduña-Bustamante

{"title":"Automatic speech recognition in the presence of babble noise and reverberation compared to human speech intelligibility in Spanish","authors":"Carlos Mena , A.L. Padilla-Ortiz , Felipe Orduña-Bustamante","doi":"10.1016/j.csl.2025.101856","DOIUrl":null,"url":null,"abstract":"<div><div>The performance of three representative automatic speech recognition (ASR) systems: NeMo, Wav2Vec, and Whisper, was evaluated for the Spanish language as spoken in the central region of Mexico, in the presence of speech babble noise as a function of signal to noise ratio (SNR) and also separately under different reverberant conditions. NeMo and Wav2Vec were pretrained, or specially fine-tuned for the recognition of Mexican Spanish, as required by the language architectures of these ASR systems, while Whisper was used without requiring such fine-tuning. Speech intelligibility tests with human participants were also carried out on the same speech material and under the same acoustic conditions: noise and reverberation. Character error rate and word error rate metrics were mapped into speech intelligibility scores, speech reception thresholds, and intelligibility slopes, the latter being performance metrics more commonly used in the evaluation of human speech intelligibility. ASR results show profiles of performance vs. SNR which are akin to those found for human listeners. Comparison with speech intelligibility results by human listeners, show speech reception thresholds (signal to noise dB levels corresponding to 50% intelligibility in the presence of acoustic noise) which are higher, showing lower performance relative to humans, by 1.8 dB for Whisper, 3.0 dB for Wav2Vec, 7.0 dB for NeMo. Intelligibility slopes (indicating rate of performance recovery with increasing SNR) were higher for Whisper (13.6%/dB) and Wav2Vec (12.0%/dB), lower for NeMo (5.0%/dB), relative to an intermediate value for humans (9.3%/dB). Performance with reverberated speech indicate reverberation time thresholds (for 50% intelligibility) of 3.1 s for Whisper, 2.6 s for humans, 1.4 s for Wav2Vec, and 1.0 s for NeMo. Whisper is seen to outperform Wav2Vec and NeMo in all aspects, while also outperforming humans in terms of speech intelligibility slope and reverberation threshold, except for speech reception threshold in noise. These results provide performance metrics for the ASR systems included in this study in the context of human speech intelligibility. Also, in view of their good performance, Whisper and Wav2Vec lend themselves to be used in predicting human speech intelligibility in different scenarios by conducting equivalent evaluations through automatic speech recognition.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101856"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000816","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The performance of three representative automatic speech recognition (ASR) systems: NeMo, Wav2Vec, and Whisper, was evaluated for the Spanish language as spoken in the central region of Mexico, in the presence of speech babble noise as a function of signal to noise ratio (SNR) and also separately under different reverberant conditions. NeMo and Wav2Vec were pretrained, or specially fine-tuned for the recognition of Mexican Spanish, as required by the language architectures of these ASR systems, while Whisper was used without requiring such fine-tuning. Speech intelligibility tests with human participants were also carried out on the same speech material and under the same acoustic conditions: noise and reverberation. Character error rate and word error rate metrics were mapped into speech intelligibility scores, speech reception thresholds, and intelligibility slopes, the latter being performance metrics more commonly used in the evaluation of human speech intelligibility. ASR results show profiles of performance vs. SNR which are akin to those found for human listeners. Comparison with speech intelligibility results by human listeners, show speech reception thresholds (signal to noise dB levels corresponding to 50% intelligibility in the presence of acoustic noise) which are higher, showing lower performance relative to humans, by 1.8 dB for Whisper, 3.0 dB for Wav2Vec, 7.0 dB for NeMo. Intelligibility slopes (indicating rate of performance recovery with increasing SNR) were higher for Whisper (13.6%/dB) and Wav2Vec (12.0%/dB), lower for NeMo (5.0%/dB), relative to an intermediate value for humans (9.3%/dB). Performance with reverberated speech indicate reverberation time thresholds (for 50% intelligibility) of 3.1 s for Whisper, 2.6 s for humans, 1.4 s for Wav2Vec, and 1.0 s for NeMo. Whisper is seen to outperform Wav2Vec and NeMo in all aspects, while also outperforming humans in terms of speech intelligibility slope and reverberation threshold, except for speech reception threshold in noise. These results provide performance metrics for the ASR systems included in this study in the context of human speech intelligibility. Also, in view of their good performance, Whisper and Wav2Vec lend themselves to be used in predicting human speech intelligibility in different scenarios by conducting equivalent evaluations through automatic speech recognition.

查看原文本刊更多论文

在西班牙语中，与人类语音可理解性相比，在咿呀学语噪声和混响存在下的自动语音识别

本文对墨西哥中部地区使用的西班牙语进行了三种具有代表性的自动语音识别（ASR）系统：NeMo、Wav2Vec和Whisper的性能进行了评估，其中包括以信噪比（SNR）为函数的语音杂音噪声，以及不同混响条件下的语音杂音噪声。根据这些ASR系统的语言架构要求，NeMo和Wav2Vec经过了预训练，或者专门针对墨西哥西班牙语的识别进行了微调，而使用Whisper则不需要进行此类微调。人类参与者的语音清晰度测试也在相同的语音材料和相同的声学条件下进行：噪音和混响。字符错误率和单词错误率指标被映射为语音可理解度评分、语音接收阈值和可理解度斜率，后者是更常用于评估人类语音可理解度的性能指标。ASR结果显示了性能与信噪比的关系，这与人类听众的情况类似。与人类听众的语音清晰度结果相比，显示出更高的语音接收阈值（在存在噪声的情况下，对应于50%可理解度的信号到噪声dB级别），相对于人类显示出更低的性能，耳语为1.8 dB， Wav2Vec为3.0 dB， NeMo为7.0 dB。相对于人类的中间值（9.3%/dB），耳语（13.6%/dB）和Wav2Vec （12.0%/dB）的可理解度斜率（表明随着信噪比增加的性能恢复率）较高（13.6%/dB）， NeMo较低（5.0%/dB）。混响语音的性能表明混响时间阈值（50%可理解度）耳语为3.1 s，人类为2.6 s， Wav2Vec为1.4 s， NeMo为1.0 s。Whisper在所有方面都优于Wav2Vec和NeMo，同时在语音可理解度斜率和混响阈值方面也优于人类，除了噪声中的语音接收阈值。这些结果为本研究中包含的ASR系统在人类语音可理解性的背景下提供了性能指标。此外，鉴于Whisper和Wav2Vec的良好表现，它们可以通过自动语音识别进行等效评估，用于预测不同场景下人类语音的可理解性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.