{"title":"Performance assessment of voice conversion models using speech production-based parameters","authors":"Ashwini Dasare, K.T. Deepak","doi":"10.1016/j.csl.2025.101853","DOIUrl":"10.1016/j.csl.2025.101853","url":null,"abstract":"<div><div>Voice Conversion (VC) transforms a source voice to sound like a target voice. However, the field requires more standardized objective metrics to evaluate its performance thoroughly. Traditional evaluation methods, such as Mel-Cepstral Distortion (MCD), F0-Root Mean Squared Error (F0RMSE), and Modulated-Spectral Distance (MSD), primarily focus on perceptual features and often overlook speech production attributes. This can result in a mismatch between perceived voice similarity and the physiological aspects of the voice, leading to a reliance on subjective methods like the Mean Opinion Score (MOS). While MOS provides valuable insights, it is resource-intensive and inherently subjective, limiting its practicality for widespread use. This research proposes an objective framework for evaluating voice quality in VC tasks by focusing on key speech production parameters, including jitter, shimmer, harmonics-to-noise ratio, and vocal tract length. Our findings suggest that these parameters, which encapsulate the distinct characteristics of a speaker’s voice, provide a more precise basis for assessing perceptual similarity between converted and target voices. Compared to traditional objective metrics like MCD, MSD, F0RMSE, and also non-intrusive measures like MOSNET, UTMOS, our proposed method consistently shows a correlation with MOS, suggesting that it better aligns with subjective evaluations of voice quality. This presents a more reliable and practical alternative to conventional methods that primarily emphasize perceptual features. This study evaluates how well different VC models, such as StarGANv2-VC, Retrival-based VC, Suno-Bark, and Diff-VC replicate speech production parameters across various languages and accents, including English, Kannada, Hindi, and the low-resource Soliga language. The results provide insights into improving the evaluation of voice conversion technologies by focusing on speech production attributes, helping to bridge the gap between perceptual similarity and physiological accuracy. The proposed work lays the groundwork for developing standardized, objective evaluation methods for VC models based on speech production characteristics.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101853"},"PeriodicalIF":3.1,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matej Ulčar , Aleš Žagar , Carlos S. Armendariz , Andraž Repar , Senja Pollak , Matthew Purver , Marko Robnik-Šikonja
{"title":"Mono- and cross-lingual evaluation of representation language models on less-resourced languages","authors":"Matej Ulčar , Aleš Žagar , Carlos S. Armendariz , Andraž Repar , Senja Pollak , Matthew Purver , Marko Robnik-Šikonja","doi":"10.1016/j.csl.2025.101852","DOIUrl":"10.1016/j.csl.2025.101852","url":null,"abstract":"<div><div>The current dominance of large language models in natural language processing is based on their contextual awareness. For text classification, text representation models, such as ELMo, BERT, and BERT derivatives, are typically fine-tuned for a specific problem. Most existing work focuses on English; in contrast, we present a large-scale multilingual empirical comparison of several monolingual and multilingual ELMo and BERT models using 14 classification tasks in nine languages. The results show, that the choice of best model largely depends on the task and language used, especially in a cross-lingual setting. In monolingual settings, monolingual BERT models tend to perform the best among BERT models. Among ELMo models, the ones trained on large corpora dominate. Cross-lingual knowledge transfer is feasible on most tasks already in a zero-shot setting without losing much performance.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101852"},"PeriodicalIF":3.1,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144557609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Mena , A.L. Padilla-Ortiz , Felipe Orduña-Bustamante
{"title":"Automatic speech recognition in the presence of babble noise and reverberation compared to human speech intelligibility in Spanish","authors":"Carlos Mena , A.L. Padilla-Ortiz , Felipe Orduña-Bustamante","doi":"10.1016/j.csl.2025.101856","DOIUrl":"10.1016/j.csl.2025.101856","url":null,"abstract":"<div><div>The performance of three representative automatic speech recognition (ASR) systems: NeMo, Wav2Vec, and Whisper, was evaluated for the Spanish language as spoken in the central region of Mexico, in the presence of speech babble noise as a function of signal to noise ratio (SNR) and also separately under different reverberant conditions. NeMo and Wav2Vec were pretrained, or specially fine-tuned for the recognition of Mexican Spanish, as required by the language architectures of these ASR systems, while Whisper was used without requiring such fine-tuning. Speech intelligibility tests with human participants were also carried out on the same speech material and under the same acoustic conditions: noise and reverberation. Character error rate and word error rate metrics were mapped into speech intelligibility scores, speech reception thresholds, and intelligibility slopes, the latter being performance metrics more commonly used in the evaluation of human speech intelligibility. ASR results show profiles of performance vs. SNR which are akin to those found for human listeners. Comparison with speech intelligibility results by human listeners, show speech reception thresholds (signal to noise dB levels corresponding to 50% intelligibility in the presence of acoustic noise) which are higher, showing lower performance relative to humans, by 1.8 dB for Whisper, 3.0 dB for Wav2Vec, 7.0 dB for NeMo. Intelligibility slopes (indicating rate of performance recovery with increasing SNR) were higher for Whisper (13.6%/dB) and Wav2Vec (12.0%/dB), lower for NeMo (5.0%/dB), relative to an intermediate value for humans (9.3%/dB). Performance with reverberated speech indicate reverberation time thresholds (for 50% intelligibility) of 3.1 s for Whisper, 2.6 s for humans, 1.4 s for Wav2Vec, and 1.0 s for NeMo. Whisper is seen to outperform Wav2Vec and NeMo in all aspects, while also outperforming humans in terms of speech intelligibility slope and reverberation threshold, except for speech reception threshold in noise. These results provide performance metrics for the ASR systems included in this study in the context of human speech intelligibility. Also, in view of their good performance, Whisper and Wav2Vec lend themselves to be used in predicting human speech intelligibility in different scenarios by conducting equivalent evaluations through automatic speech recognition.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101856"},"PeriodicalIF":3.1,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Apiwat Ditthapron , Emmanuel O. Agu , Adam C. Lammert
{"title":"Privacy-preserving feature extractor using adversarial pruning for TBI assessment from speech","authors":"Apiwat Ditthapron , Emmanuel O. Agu , Adam C. Lammert","doi":"10.1016/j.csl.2025.101854","DOIUrl":"10.1016/j.csl.2025.101854","url":null,"abstract":"<div><div>Speech is an effective indicator of medical conditions such as Traumatic Brain Injury (TBI), but frequently includes private information, preventing novel passive, real-world assessments using the patient’s smartphone. Privacy research for speech processing has primarily focused on hiding the speaker’s identity, which is utilized in authentication systems and cannot be renewed. Our study extends privacy to include the content of speech, specifically sensitive words during conversation. Prior work has proposed extracting privacy-preserving features via adversarial training, which trains a neural network to defend against attacks on private data that an adversarial network is simultaneously attempting to access. However, adversarial training has an unsolved problem of training instability due to the inherent limitations of minimax optimization. Instead, our study introduces Privacy-Preserving using Adversarial Pruning (PPA-Pruning). Nodes are systematically removed from the network while prioritizing those contributing most to the recognition of personal data from a well-trained feature extractor designed for TBI detection and adversarial tasks. PPA-Pruning was evaluated for various privacy budgets via a differential privacy setup. Notably, PPA-Pruning outperforms baseline methods, including adversarial training and Laplace noise, achieving up to an 11% improvement in TBI detection accuracy at the same privacy level.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101854"},"PeriodicalIF":3.1,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144491288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sentiment classification method based on BERT-CondConv multi-moment state fusion","authors":"Wang Xiaoyang , Liu Wenfeng","doi":"10.1016/j.csl.2025.101855","DOIUrl":"10.1016/j.csl.2025.101855","url":null,"abstract":"<div><div>Sentiment classification has emerged as a significant research area in the field of natural language processing, garnering considerable attention in recent years. However, obtaining feature information of text sequences for sentiment classification, especially for texts with diverse characteristics, remains a challenging task. Traditional methods for extracting text features often treat all data in a uniform manner. To address this issue, we propose a hybrid sentiment classification model called BERT-CondConv, which integrates the strengths of BERT and conditional parameter convolution networks. By applying adaptive conditional parameter convolution on the hidden state feature information at different time steps of BERT, our model enhances feature extraction and optimization, and finally fusion features, thus improving the sentiment classification task. We compared various base model architectures and benchmarked our method against state-of-the-art techniques. The experimental results demonstrate the effectiveness of our approach.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101855"},"PeriodicalIF":3.1,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AraFastQA: a transformer model for question-answering for Arabic language using few-shot learning","authors":"Asmaa Alrayzah , Fawaz Alsolami , Mostafa Saleh","doi":"10.1016/j.csl.2025.101857","DOIUrl":"10.1016/j.csl.2025.101857","url":null,"abstract":"<div><div>In recent years, numerous studies have developed pre-trained language models (PLMs) for Arabic natural language processing (NLP) tasks, including question-answering (QA), but often overlooking the challenge of data scarcity. This study introduces the Arabic Few-Shot QA (AraFastQA) pre-trained language model to confront the challenge of limited resources in Arabic QA tasks. The primary contributions of this study involve developing an PLM based on a few-shot learning (FSL) approach to address the challenge of low-resource datasets in Arabic QA. Moreover, this study contributes to the developing of Arabic benchmark few-shot QA datasets. By using the few-shot datasets, we compare the AraFastQA PLM with the state-of-art Arabic PLMs such that AraBERT, AraELECTRA, and XLM-Roberta. We evaluated AraFastQA and state-of-art models on two Arabic benchmark datasets that are Arabic reading comprehension (ARCD) and the typologically diverse question answering (TyDiQA). The obtained experimental results show that AraFastQA outperforms other models across eight training sample sizes of the Arabic benchmark datasets. For instance, our proposed PLM achieves 73.2 of F1-score on TyDi QA with only 1024 training examples while the highest accuracy of other models (AraELECTRA) achieves 56.1. For the full training dataset of ARCD dataset, AraFastQA improves accuracy by 9 %, 3 %, and 10 % of AraBERT, AraELECTRA, and XLM-Roberta respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101857"},"PeriodicalIF":3.1,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144470963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik
{"title":"Predicting accentedness and comprehensibility through ASR scores and acoustic features","authors":"Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik","doi":"10.1016/j.csl.2025.101858","DOIUrl":"10.1016/j.csl.2025.101858","url":null,"abstract":"<div><div>Accentedness and comprehensibility scales are widely used in measuring the oral proficiency of second language (L2) learners, including learners of English as a Second Language (ESL). In this paper, we focus on gaining a better understanding of the concepts of accentedness and comprehensibility by developing and applying automatic measures to ESL utterances produced by Indonesian learners. We extracted features both on the segmental and the suprasegmental (fundamental frequency, loudness, energy et al.) levels to investigate which features are actually related to expert judgments on accentedness and comprehensibility. Automatic Speech Recognition (ASR) pronunciation scores based on the traditional Kaldi Time Delay Neural Network (TDNN) model and on the End-to-End Whisper model were applied, and data-driven methods were used by combining acoustic features extracted by the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and Praat. The experimental results showed that Whisper outperformed the Kaldi-TDNN model. The Whisper model gave the best results for predicting comprehensibility on the basis of phone distance, and the best results for predicting accentedness on the basis of grapheme distance. Combining segmental and suprasegmental features improved the results, yielding different feature rankings for comprehensibility and accentedness. In our final step of analysis, we included differences between utterances and learners as random effects in a mixed linear regression model. Exploiting these information sources yielded a substantial improvement in predicting both comprehensibility and accentedness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101858"},"PeriodicalIF":3.1,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144470962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-turn response selection with Language Style and Topic Aware enhancement","authors":"Weiwei Li, Yuzhong Chen, Junjie Xu, Jiayuan Zhong, Chen Dong","doi":"10.1016/j.csl.2025.101842","DOIUrl":"10.1016/j.csl.2025.101842","url":null,"abstract":"<div><div>The multi-turn response selection is an important component in retrieval-based human–computer dialogue systems. Most recent models adopt the utilization of pre-trained language models to acquire fine-grained semantic information within diverse dialogue contexts, thereby enhancing the precision of response selection. However, effectively leveraging the language style information of speakers along with the topic information in the dialogue context to enhance the semantic understanding capability of pre-trained language models still poses a significant challenge that requires resolution. To address this challenge, we propose a BERT-based Language Style and Topic Aware (BERT-LSTA) model for multi-turn response selection. BERT-LSTA augments BERT with two distinctive modules: the Language Style Aware (LSA) module and the Question-oriented Topic Window Selection (QTWS) module. The LSA module introduces a contrastive learning method to learn the latent language style information from distinct speakers in the dialogue. The QTWS module proposes a topic window segmentation algorithm to segment the dialogue context into topic windows, which facilitates the capacity of BERT-LSTA to refine and incorporate relevant topic information for response selection. Experimental results on two public benchmark datasets demonstrate that BERT-LSTA outperforms all state-of-the-art baseline models across various metrics. Furthermore, ablation studies reveal that the LSA module significantly improves performance by capturing speaker-specific language styles, while the QTWS module enhances topic relevance by filtering irrelevant contextual information.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101842"},"PeriodicalIF":3.1,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144298502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minerva 2 for speech and language tasks","authors":"Rhiannon Mogridge, Anton Ragni","doi":"10.1016/j.csl.2025.101843","DOIUrl":"10.1016/j.csl.2025.101843","url":null,"abstract":"<div><div>Most artificial neural networks do not directly incorporate a memory of previous experiences, instead using training data to parameterise a model, and then discarding the training data prior to inference. While some recent models have included a memory, this has typically been added to an already highly parameterised model. An alternative option is to use a purely memory-based model, and then add parameters. This has been shown to work for Minerva 2, a simple, non-parametric, memory-based model which has been widely used in the field of human psychology. We revisit the use of Minerva 2 for speech and language tasks, drawing comparisons between Minerva 2 and other architectures, and showing that an iterative process that Minerva 2 uses for inference is a close relative of deep equilibrium models. We assess parameterised models based on Minerva 2, including a sequence model inspired by Minerva 2’s similarity to the transformer architecture, which shows promising results.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101843"},"PeriodicalIF":3.1,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144314149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Polok , Dominik Klement , Martin Kocour , Jiangyu Han , Federico Landini , Bolaji Yusuf , Matthew Wiesner , Sanjeev Khudanpur , Jan Černocký , Lukáš Burget
{"title":"DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition","authors":"Alexander Polok , Dominik Klement , Martin Kocour , Jiangyu Han , Federico Landini , Bolaji Yusuf , Matthew Wiesner , Sanjeev Khudanpur , Jan Černocký , Lukáš Burget","doi":"10.1016/j.csl.2025.101841","DOIUrl":"10.1016/j.csl.2025.101841","url":null,"abstract":"<div><div>Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model’s focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model’s target-speaker ASR capabilities while maintaining Whisper’s accuracy and robustness on single-speaker data.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101841"},"PeriodicalIF":3.1,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144314148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}