{"title":"Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection","authors":"Bang Zeng, Ming Li","doi":"10.1016/j.csl.2025.101807","DOIUrl":"10.1016/j.csl.2025.101807","url":null,"abstract":"<div><div>Determining “who spoke what and when” remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of “who spoke when”, while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of “who spoke what”. Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker embeddings as in traditional approaches. Additionally, a multi-task learning algorithm with a scenario-aware differentiated loss function is applied to ensure robust performance across various levels of speaker overlap. The experimental results show that our proposed USEF-TP model achieves superior performance in TSE and PVAD tasks on the LibriMix and SparseLibriMix datasets. The results on the CALLHOME dataset demonstrate the competitive performance of our model on real recordings.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101807"},"PeriodicalIF":3.1,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143918432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tailored design of Audio–Visual Speech Recognition models using Branchformers","authors":"David Gimeno-Gómez, Carlos D. Martínez-Hinarejos","doi":"10.1016/j.csl.2025.101811","DOIUrl":"10.1016/j.csl.2025.101811","url":null,"abstract":"<div><div>Recent advances in Audio–Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio–visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio–visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as the Branchformer, in the design of parameter-efficient AVSR systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio–visual unified encoder based on the layer-level branch scores provided by the modality-specific models. Extensive experiments on English and Spanish AVSR benchmarks covering multiple data conditions and scenarios demonstrated the effectiveness of our proposed method. Even when trained on a moderate scale of data, our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish, establishing a new benchmark with an average WER of around 9.1%. These results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates while significantly reducing the model complexity w.r.t. the prevalent approach in the field. Code and pre-trained models are available at <span><span>https://github.com/david-gimeno/tailored-avsr</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101811"},"PeriodicalIF":3.1,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143918433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoxiao Miao , Yuxiang Zhang , Xin Wang , Natalia Tomashenko , Donny Cheng Lock Soh , Ian Mcloughlin
{"title":"Adapting general disentanglement-based speaker anonymization for enhanced emotion preservation","authors":"Xiaoxiao Miao , Yuxiang Zhang , Xin Wang , Natalia Tomashenko , Donny Cheng Lock Soh , Ian Mcloughlin","doi":"10.1016/j.csl.2025.101810","DOIUrl":"10.1016/j.csl.2025.101810","url":null,"abstract":"<div><div>A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker’s identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.<span><span><sup>2</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101810"},"PeriodicalIF":3.1,"publicationDate":"2025-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143906655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Jing , Vidhyasaharan Sethu , Beena Ahmed , Kong Aik Lee
{"title":"Quantifying prediction uncertainties in automatic speaker verification systems","authors":"Miao Jing , Vidhyasaharan Sethu , Beena Ahmed , Kong Aik Lee","doi":"10.1016/j.csl.2025.101806","DOIUrl":"10.1016/j.csl.2025.101806","url":null,"abstract":"<div><div>For modern automatic speaker verification (ASV) systems, explicitly quantifying the confidence for each prediction strengthens the system’s reliability by indicating in which case the system is with trust. However, current paradigms do not take this into consideration. We thus propose to express confidence in the prediction by quantifying the uncertainty in ASV predictions. This is achieved by developing a novel Bayesian framework to obtain a score distribution for each input. The mean of the distribution is used to derive the decision while the spread of the distribution represents the uncertainty arising from the plausible choices of the model parameters. To capture the plausible choices, we sample the probabilistic linear discriminant analysis (PLDA) back-end model posterior through Hamiltonian Monte-Carlo (HMC) and approximate the embedding model posterior through stochastic Langevin dynamics (SGLD) and Bayes-by-backprop. Given the resulting score distribution, a further quantification and decomposition of the prediction uncertainty are achieved by calculating the score variance, entropy, and mutual information. The quantified uncertainties include the aleatoric uncertainty and epistemic uncertainty (model uncertainty). We evaluate them by observing how they change while varying the amount of training speech, the duration, and the noise level of testing speech. The experiments indicate that the behaviour of those quantified uncertainties reflects the changes we made to the training and testing data, demonstrating the validity of the proposed method as a measure of uncertainty.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101806"},"PeriodicalIF":3.1,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143903647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiayu Du , Jinpeng Li , Guoguo Chen , Wei-Qiang Zhang
{"title":"SpeechColab leaderboard: An open-source platform for automatic speech recognition evaluation","authors":"Jiayu Du , Jinpeng Li , Guoguo Chen , Wei-Qiang Zhang","doi":"10.1016/j.csl.2025.101805","DOIUrl":"10.1016/j.csl.2025.101805","url":null,"abstract":"<div><div>In the wake of the surging tide of deep learning over the past decade, Automatic Speech Recognition (ASR) has garnered substantial attention, leading to the emergence of numerous publicly accessible ASR systems that are actively being integrated into our daily lives. Nonetheless, impartial and replicable evaluations of these ASR systems encounter challenges due to various subtleties. In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. With this platform: (i) We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems, covering both open-source models and industrial commercial services. (ii) We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes, including capitalization, punctuation, interjection, contraction, synonym usage, compound words, etc. These issues have gained prominence in the context of the transition towards End-to-End ASR systems. (iii) We propose and discuss a modification to the conventional Token-Error-Rate (TER) metric, called modified-TER (mTER), inspired from Kolmogorov Complexity and Normalized Information Distance (NID). The proposed metric becomes normalized and symmetrical (with regard to reference and hypothesis). A large-scale empirical study is then presented comparing TER and mTER. The SpeechColab Leaderboard is accessible at <span><span>https://github.com/SpeechColab/Leaderboard</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101805"},"PeriodicalIF":3.1,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143881968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingze Lu , Yuxiang Zhang , Zhuo Li , Zengqiang Shang , Wenchao Wang , Pengyuan Zhang
{"title":"Leveraging distance information for generalized spoofing speech detection","authors":"Jingze Lu , Yuxiang Zhang , Zhuo Li , Zengqiang Shang , Wenchao Wang , Pengyuan Zhang","doi":"10.1016/j.csl.2025.101804","DOIUrl":"10.1016/j.csl.2025.101804","url":null,"abstract":"<div><div>Spoofing speech detection (SSD) systems are confronted with insufficient generalization ability for in-the-wild data, including unseen attacks and bonafide speech from unseen distributions, which hampers their applicability in real-world scenarios. Such performance degradation could be attributed to the inherent flaw of deep neural network (DNN)-based models, that is, overlearning the training data. Inter-instance distance, which is underutilized in conventional DNN-based classifiers, proves beneficial in handling unseen samples. Our experiments indicate that the distances between bonafide speech are closer than spoofing one in certain feature spaces. Therefore, this paper proposes a distance-based method to enhance anti-spoofing models’ generalization ability. By incorporating distance features as a prefix, the proposed method achieves lightweight parameter updates while effectively detecting unseen attacks and bonafide utterances from unseen distributions. On the logical access of ASVspoof 2019 and ASVspoof 2021, the proposed method achieves 0.53% and 4.73% equal error rates (EERs). Moreover, it achieves 1.86% and 7.30% EERs on the ASVspoof 2021 Deepfake and IntheWild datasets, respectively, demonstrating its superior generalization ability. The proposed method outperforms other state-of-the-art (SOTA) methods on multiple datasets.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101804"},"PeriodicalIF":3.1,"publicationDate":"2025-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143918431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Triantafyllopoulos , Anika A. Spiesberger , Iosif Tsangko , Xin Jing , Verena Distler , Felix Dietz , Florian Alt , Björn W. Schuller
{"title":"Vishing: Detecting social engineering in spoken communication — A first survey & urgent roadmap to address an emerging societal challenge","authors":"Andreas Triantafyllopoulos , Anika A. Spiesberger , Iosif Tsangko , Xin Jing , Verena Distler , Felix Dietz , Florian Alt , Björn W. Schuller","doi":"10.1016/j.csl.2025.101802","DOIUrl":"10.1016/j.csl.2025.101802","url":null,"abstract":"<div><div>Vishing – the use of voice calls for phishing – is a form of Social Engineering (SE) attacks. The latter have become a pervasive challenge in modern societies, with over 300,000 yearly victims in the US alone. An increasing number of those attacks is conducted via voice communication, be it through machine-generated ‘robocalls’ or human actors. The goals of ‘social engineers’ can be manifold, from outright fraud to more subtle forms of persuasion. Accordingly, social engineers adopt multi-faceted strategies for voice-based attacks, utilising a variety of ‘tricks’ to exert influence and achieve their goals. Importantly, while organisations have set in place a series of guardrails against other types of SE attacks, voice calls still remain ‘open ground’ for potential bad actors. In the present contribution, we provide an overview of the existing speech technology subfields that need to coalesce into a protective net against one of the major challenges to societies worldwide. Given the dearth of speech science and technology works targeting this issue, we have opted for a narrative review that bridges the gap between the existing psychological literature on the topic and research that has been pursued in parallel by the speech community on some of the constituent constructs. Our review reveals that very little literature exists on addressing this very important topic from a speech technology perspective, an omission further exacerbated by the lack of available data. Thus, our main goal is to highlight this gap and sketch out a roadmap to mitigate it, beginning with the psychological underpinnings of vishing, which primarily include deception and persuasion strategies, continuing with the speech-based approaches that can be used to detect those, as well as the generation and detection of AI-based vishing attempts, and close with a discussion of ethical and legal considerations.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101802"},"PeriodicalIF":3.1,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Influence of the surprisal power adjustment on spoken word duration in emotional speech in Serbian","authors":"Jelena Lazić, Sanja Vujnović","doi":"10.1016/j.csl.2025.101803","DOIUrl":"10.1016/j.csl.2025.101803","url":null,"abstract":"<div><div>Emotional speech analysis has been a topic of interest across multiple disciplines. However, it remains a challenging task due to its complexity and multimodality. Computer systems still struggle with robustness when dealing with emotional speech. Despite being a difficult area of research, the wide range of potential applications, especially nowadays in the era of intelligent agents and humanoid systems, has led to increased interest in this field. With the development of machine learning models, a variety of novel techniques have emerged, including pre-trained language models. In this work, we used these models to research emotional speech analysis from an information-theory perspective. Specifically, we focused on analyzing language processing difficulty, measured by word-level spoken time duration, and its relation to information distribution over speech, measured by word-level surprisal values. We analyzed a dataset of audio recordings in the low-resourced Serbian language, recorded under five different speakers’ emotional states. Seven state-of-the-art machine learning language models were employed to estimate surprisal values, which were then used as predictive parameters for word-level spoken time duration. Our results supported related studies in the English language and indicated that machine learning-estimated surprisal values may be good predictors of speech parameters in Serbian. Furthermore, modulating the power of surprisal values led to different outcomes for various speakers’ emotional states. This suggests potential differences in the role of surprisal values in speech production under different emotional conditions.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101803"},"PeriodicalIF":3.1,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143850116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khanh Quoc Tran, Quang Phan-Minh Huynh, Oanh Thi-Hong Le, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
{"title":"ViTASA: New benchmark and methods for Vietnamese targeted aspect sentiment analysis for multiple textual domains","authors":"Khanh Quoc Tran, Quang Phan-Minh Huynh, Oanh Thi-Hong Le, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen","doi":"10.1016/j.csl.2025.101800","DOIUrl":"10.1016/j.csl.2025.101800","url":null,"abstract":"<div><div>Targeted Aspect Sentiment Analysis (TASA) has gained substantial attraction in recent years, fostering diverse studies and technological advancements. However, the development of TASA resources for Vietnamese has been limited. This paper introduces ViTASA, a comprehensive, high-quality dataset designed to catalyze advancements in Vietnamese TASA. ViTASA encompasses over 500,000 target-aspect pairs from social media comments across three key domains: mobile, restaurant, and hotel, thereby addressing critical gaps in existing datasets. Additionally, ViTASA integrates a novel multi-task evaluation framework, posing new challenges and enabling robust model assessments. We present ViTASD, an innovative BERT-based approach optimized for the linguistic features of Vietnamese. Comparative analyses demonstrate that ViTASD significantly outperforms existing state-of-the-art methods, including CG-BERT, QACG-BERT, BERT-pair-QA, BERT-pair-NLI, and a range of zero-shot learning models like Gemma, Llama, Mistral and Qwen. Notably, ViTASD achieves superior macro F1-scores of 61.77%, 41.12%, and 52.64% in the mobile, restaurant, and hotel domains respectively. This study not only highlights the challenges inherent in Vietnamese sentiment analysis but also lays a robust foundation for future research endeavors in this area. In a commitment to advancing TASA technology and enhancing the reliability of digital media analyses, we have made the ViTASA dataset, model checkpoints, and source code openly accessible on GitHub<span><span><sup>1</sup></span></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101800"},"PeriodicalIF":3.1,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143734937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cong Pang , Ye Ni , Lin Zhou , Li Zhao , Feifei Xiong
{"title":"Exploiting spatial information and target speaker phoneme loss for multichannel directional speech enhancement and recognition","authors":"Cong Pang , Ye Ni , Lin Zhou , Li Zhao , Feifei Xiong","doi":"10.1016/j.csl.2025.101801","DOIUrl":"10.1016/j.csl.2025.101801","url":null,"abstract":"<div><div>Directional speech extraction catches increasing attention recently in multichannel speech separation, as it focuses solely on extracting the target speech to make real-time communication (RTC) and automatic speech recognition (ASR) more productive. This work investigates a real-time multichannel neural framework for directional speech enhancement and recognition by exploiting the explicit spatial information derived from the microphone array geometry, and the implicit spatial information learned from a dedicated narrow-band network. In addition to the traditional signal-based loss functions, we further introduce a loss inspired by the ASR phoneme mismatch to guide the framework training towards the distortion-less target speech signals. Experimental results with simulated datasets show that the proposed framework significantly improves the speech quality of the target speaker locating at the specific direction in noisy and reverberant environments with interfering speakers. The improved ASR results with the real-recorded dataset of live conversations from the CHiME8 MMCSG Challenge further verify the effectiveness of our system for practical applications.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101801"},"PeriodicalIF":3.1,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}