Carlos Mena , A.L. Padilla-Ortiz , Felipe Orduña-Bustamante
{"title":"Automatic speech recognition in the presence of babble noise and reverberation compared to human speech intelligibility in Spanish","authors":"Carlos Mena , A.L. Padilla-Ortiz , Felipe Orduña-Bustamante","doi":"10.1016/j.csl.2025.101856","DOIUrl":null,"url":null,"abstract":"<div><div>The performance of three representative automatic speech recognition (ASR) systems: NeMo, Wav2Vec, and Whisper, was evaluated for the Spanish language as spoken in the central region of Mexico, in the presence of speech babble noise as a function of signal to noise ratio (SNR) and also separately under different reverberant conditions. NeMo and Wav2Vec were pretrained, or specially fine-tuned for the recognition of Mexican Spanish, as required by the language architectures of these ASR systems, while Whisper was used without requiring such fine-tuning. Speech intelligibility tests with human participants were also carried out on the same speech material and under the same acoustic conditions: noise and reverberation. Character error rate and word error rate metrics were mapped into speech intelligibility scores, speech reception thresholds, and intelligibility slopes, the latter being performance metrics more commonly used in the evaluation of human speech intelligibility. ASR results show profiles of performance vs. SNR which are akin to those found for human listeners. Comparison with speech intelligibility results by human listeners, show speech reception thresholds (signal to noise dB levels corresponding to 50% intelligibility in the presence of acoustic noise) which are higher, showing lower performance relative to humans, by 1.8 dB for Whisper, 3.0 dB for Wav2Vec, 7.0 dB for NeMo. Intelligibility slopes (indicating rate of performance recovery with increasing SNR) were higher for Whisper (13.6%/dB) and Wav2Vec (12.0%/dB), lower for NeMo (5.0%/dB), relative to an intermediate value for humans (9.3%/dB). Performance with reverberated speech indicate reverberation time thresholds (for 50% intelligibility) of 3.1 s for Whisper, 2.6 s for humans, 1.4 s for Wav2Vec, and 1.0 s for NeMo. Whisper is seen to outperform Wav2Vec and NeMo in all aspects, while also outperforming humans in terms of speech intelligibility slope and reverberation threshold, except for speech reception threshold in noise. These results provide performance metrics for the ASR systems included in this study in the context of human speech intelligibility. Also, in view of their good performance, Whisper and Wav2Vec lend themselves to be used in predicting human speech intelligibility in different scenarios by conducting equivalent evaluations through automatic speech recognition.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101856"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000816","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The performance of three representative automatic speech recognition (ASR) systems: NeMo, Wav2Vec, and Whisper, was evaluated for the Spanish language as spoken in the central region of Mexico, in the presence of speech babble noise as a function of signal to noise ratio (SNR) and also separately under different reverberant conditions. NeMo and Wav2Vec were pretrained, or specially fine-tuned for the recognition of Mexican Spanish, as required by the language architectures of these ASR systems, while Whisper was used without requiring such fine-tuning. Speech intelligibility tests with human participants were also carried out on the same speech material and under the same acoustic conditions: noise and reverberation. Character error rate and word error rate metrics were mapped into speech intelligibility scores, speech reception thresholds, and intelligibility slopes, the latter being performance metrics more commonly used in the evaluation of human speech intelligibility. ASR results show profiles of performance vs. SNR which are akin to those found for human listeners. Comparison with speech intelligibility results by human listeners, show speech reception thresholds (signal to noise dB levels corresponding to 50% intelligibility in the presence of acoustic noise) which are higher, showing lower performance relative to humans, by 1.8 dB for Whisper, 3.0 dB for Wav2Vec, 7.0 dB for NeMo. Intelligibility slopes (indicating rate of performance recovery with increasing SNR) were higher for Whisper (13.6%/dB) and Wav2Vec (12.0%/dB), lower for NeMo (5.0%/dB), relative to an intermediate value for humans (9.3%/dB). Performance with reverberated speech indicate reverberation time thresholds (for 50% intelligibility) of 3.1 s for Whisper, 2.6 s for humans, 1.4 s for Wav2Vec, and 1.0 s for NeMo. Whisper is seen to outperform Wav2Vec and NeMo in all aspects, while also outperforming humans in terms of speech intelligibility slope and reverberation threshold, except for speech reception threshold in noise. These results provide performance metrics for the ASR systems included in this study in the context of human speech intelligibility. Also, in view of their good performance, Whisper and Wav2Vec lend themselves to be used in predicting human speech intelligibility in different scenarios by conducting equivalent evaluations through automatic speech recognition.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.