Computer Speech and Language最新文献

筛选
英文 中文
Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection 通用扬声器嵌入免费目标扬声器提取和个人语音活动检测
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-05-05 DOI: 10.1016/j.csl.2025.101807
Bang Zeng, Ming Li
{"title":"Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection","authors":"Bang Zeng,&nbsp;Ming Li","doi":"10.1016/j.csl.2025.101807","DOIUrl":"10.1016/j.csl.2025.101807","url":null,"abstract":"<div><div>Determining “who spoke what and when” remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of “who spoke when”, while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of “who spoke what”. Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker embeddings as in traditional approaches. Additionally, a multi-task learning algorithm with a scenario-aware differentiated loss function is applied to ensure robust performance across various levels of speaker overlap. The experimental results show that our proposed USEF-TP model achieves superior performance in TSE and PVAD tasks on the LibriMix and SparseLibriMix datasets. The results on the CALLHOME dataset demonstrate the competitive performance of our model on real recordings.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101807"},"PeriodicalIF":3.1,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143918432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tailored design of Audio–Visual Speech Recognition models using Branchformers 定制设计的视听语音识别模型使用Branchformers
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-05-05 DOI: 10.1016/j.csl.2025.101811
David Gimeno-Gómez, Carlos D. Martínez-Hinarejos
{"title":"Tailored design of Audio–Visual Speech Recognition models using Branchformers","authors":"David Gimeno-Gómez,&nbsp;Carlos D. Martínez-Hinarejos","doi":"10.1016/j.csl.2025.101811","DOIUrl":"10.1016/j.csl.2025.101811","url":null,"abstract":"<div><div>Recent advances in Audio–Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio–visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio–visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as the Branchformer, in the design of parameter-efficient AVSR systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio–visual unified encoder based on the layer-level branch scores provided by the modality-specific models. Extensive experiments on English and Spanish AVSR benchmarks covering multiple data conditions and scenarios demonstrated the effectiveness of our proposed method. Even when trained on a moderate scale of data, our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish, establishing a new benchmark with an average WER of around 9.1%. These results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates while significantly reducing the model complexity w.r.t. the prevalent approach in the field. Code and pre-trained models are available at <span><span>https://github.com/david-gimeno/tailored-avsr</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101811"},"PeriodicalIF":3.1,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143918433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adapting general disentanglement-based speaker anonymization for enhanced emotion preservation 采用基于解纠缠的说话人匿名化方法增强情感保存
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-05-04 DOI: 10.1016/j.csl.2025.101810
Xiaoxiao Miao , Yuxiang Zhang , Xin Wang , Natalia Tomashenko , Donny Cheng Lock Soh , Ian Mcloughlin
{"title":"Adapting general disentanglement-based speaker anonymization for enhanced emotion preservation","authors":"Xiaoxiao Miao ,&nbsp;Yuxiang Zhang ,&nbsp;Xin Wang ,&nbsp;Natalia Tomashenko ,&nbsp;Donny Cheng Lock Soh ,&nbsp;Ian Mcloughlin","doi":"10.1016/j.csl.2025.101810","DOIUrl":"10.1016/j.csl.2025.101810","url":null,"abstract":"<div><div>A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker’s identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.<span><span><sup>2</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101810"},"PeriodicalIF":3.1,"publicationDate":"2025-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143906655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantifying prediction uncertainties in automatic speaker verification systems 自动说话人验证系统中预测不确定性的量化
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-04-30 DOI: 10.1016/j.csl.2025.101806
Miao Jing , Vidhyasaharan Sethu , Beena Ahmed , Kong Aik Lee
{"title":"Quantifying prediction uncertainties in automatic speaker verification systems","authors":"Miao Jing ,&nbsp;Vidhyasaharan Sethu ,&nbsp;Beena Ahmed ,&nbsp;Kong Aik Lee","doi":"10.1016/j.csl.2025.101806","DOIUrl":"10.1016/j.csl.2025.101806","url":null,"abstract":"<div><div>For modern automatic speaker verification (ASV) systems, explicitly quantifying the confidence for each prediction strengthens the system’s reliability by indicating in which case the system is with trust. However, current paradigms do not take this into consideration. We thus propose to express confidence in the prediction by quantifying the uncertainty in ASV predictions. This is achieved by developing a novel Bayesian framework to obtain a score distribution for each input. The mean of the distribution is used to derive the decision while the spread of the distribution represents the uncertainty arising from the plausible choices of the model parameters. To capture the plausible choices, we sample the probabilistic linear discriminant analysis (PLDA) back-end model posterior through Hamiltonian Monte-Carlo (HMC) and approximate the embedding model posterior through stochastic Langevin dynamics (SGLD) and Bayes-by-backprop. Given the resulting score distribution, a further quantification and decomposition of the prediction uncertainty are achieved by calculating the score variance, entropy, and mutual information. The quantified uncertainties include the aleatoric uncertainty and epistemic uncertainty (model uncertainty). We evaluate them by observing how they change while varying the amount of training speech, the duration, and the noise level of testing speech. The experiments indicate that the behaviour of those quantified uncertainties reflects the changes we made to the training and testing data, demonstrating the validity of the proposed method as a measure of uncertainty.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101806"},"PeriodicalIF":3.1,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143903647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpeechColab leaderboard: An open-source platform for automatic speech recognition evaluation 一个用于自动语音识别评估的开源平台
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-04-25 DOI: 10.1016/j.csl.2025.101805
Jiayu Du , Jinpeng Li , Guoguo Chen , Wei-Qiang Zhang
{"title":"SpeechColab leaderboard: An open-source platform for automatic speech recognition evaluation","authors":"Jiayu Du ,&nbsp;Jinpeng Li ,&nbsp;Guoguo Chen ,&nbsp;Wei-Qiang Zhang","doi":"10.1016/j.csl.2025.101805","DOIUrl":"10.1016/j.csl.2025.101805","url":null,"abstract":"<div><div>In the wake of the surging tide of deep learning over the past decade, Automatic Speech Recognition (ASR) has garnered substantial attention, leading to the emergence of numerous publicly accessible ASR systems that are actively being integrated into our daily lives. Nonetheless, impartial and replicable evaluations of these ASR systems encounter challenges due to various subtleties. In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. With this platform: (i) We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems, covering both open-source models and industrial commercial services. (ii) We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes, including capitalization, punctuation, interjection, contraction, synonym usage, compound words, etc. These issues have gained prominence in the context of the transition towards End-to-End ASR systems. (iii) We propose and discuss a modification to the conventional Token-Error-Rate (TER) metric, called modified-TER (mTER), inspired from Kolmogorov Complexity and Normalized Information Distance (NID). The proposed metric becomes normalized and symmetrical (with regard to reference and hypothesis). A large-scale empirical study is then presented comparing TER and mTER. The SpeechColab Leaderboard is accessible at <span><span>https://github.com/SpeechColab/Leaderboard</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101805"},"PeriodicalIF":3.1,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143881968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging distance information for generalized spoofing speech detection 利用距离信息进行广义欺骗语音检测
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-04-20 DOI: 10.1016/j.csl.2025.101804
Jingze Lu , Yuxiang Zhang , Zhuo Li , Zengqiang Shang , Wenchao Wang , Pengyuan Zhang
{"title":"Leveraging distance information for generalized spoofing speech detection","authors":"Jingze Lu ,&nbsp;Yuxiang Zhang ,&nbsp;Zhuo Li ,&nbsp;Zengqiang Shang ,&nbsp;Wenchao Wang ,&nbsp;Pengyuan Zhang","doi":"10.1016/j.csl.2025.101804","DOIUrl":"10.1016/j.csl.2025.101804","url":null,"abstract":"<div><div>Spoofing speech detection (SSD) systems are confronted with insufficient generalization ability for in-the-wild data, including unseen attacks and bonafide speech from unseen distributions, which hampers their applicability in real-world scenarios. Such performance degradation could be attributed to the inherent flaw of deep neural network (DNN)-based models, that is, overlearning the training data. Inter-instance distance, which is underutilized in conventional DNN-based classifiers, proves beneficial in handling unseen samples. Our experiments indicate that the distances between bonafide speech are closer than spoofing one in certain feature spaces. Therefore, this paper proposes a distance-based method to enhance anti-spoofing models’ generalization ability. By incorporating distance features as a prefix, the proposed method achieves lightweight parameter updates while effectively detecting unseen attacks and bonafide utterances from unseen distributions. On the logical access of ASVspoof 2019 and ASVspoof 2021, the proposed method achieves 0.53% and 4.73% equal error rates (EERs). Moreover, it achieves 1.86% and 7.30% EERs on the ASVspoof 2021 Deepfake and IntheWild datasets, respectively, demonstrating its superior generalization ability. The proposed method outperforms other state-of-the-art (SOTA) methods on multiple datasets.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101804"},"PeriodicalIF":3.1,"publicationDate":"2025-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143918431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vishing: Detecting social engineering in spoken communication — A first survey & urgent roadmap to address an emerging societal challenge 维辛:在口语交流中发现社会工程——第一次调查和解决新出现的社会挑战的紧急路线图
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-04-15 DOI: 10.1016/j.csl.2025.101802
Andreas Triantafyllopoulos , Anika A. Spiesberger , Iosif Tsangko , Xin Jing , Verena Distler , Felix Dietz , Florian Alt , Björn W. Schuller
{"title":"Vishing: Detecting social engineering in spoken communication — A first survey & urgent roadmap to address an emerging societal challenge","authors":"Andreas Triantafyllopoulos ,&nbsp;Anika A. Spiesberger ,&nbsp;Iosif Tsangko ,&nbsp;Xin Jing ,&nbsp;Verena Distler ,&nbsp;Felix Dietz ,&nbsp;Florian Alt ,&nbsp;Björn W. Schuller","doi":"10.1016/j.csl.2025.101802","DOIUrl":"10.1016/j.csl.2025.101802","url":null,"abstract":"<div><div>Vishing – the use of voice calls for phishing – is a form of Social Engineering (SE) attacks. The latter have become a pervasive challenge in modern societies, with over 300,000 yearly victims in the US alone. An increasing number of those attacks is conducted via voice communication, be it through machine-generated ‘robocalls’ or human actors. The goals of ‘social engineers’ can be manifold, from outright fraud to more subtle forms of persuasion. Accordingly, social engineers adopt multi-faceted strategies for voice-based attacks, utilising a variety of ‘tricks’ to exert influence and achieve their goals. Importantly, while organisations have set in place a series of guardrails against other types of SE attacks, voice calls still remain ‘open ground’ for potential bad actors. In the present contribution, we provide an overview of the existing speech technology subfields that need to coalesce into a protective net against one of the major challenges to societies worldwide. Given the dearth of speech science and technology works targeting this issue, we have opted for a narrative review that bridges the gap between the existing psychological literature on the topic and research that has been pursued in parallel by the speech community on some of the constituent constructs. Our review reveals that very little literature exists on addressing this very important topic from a speech technology perspective, an omission further exacerbated by the lack of available data. Thus, our main goal is to highlight this gap and sketch out a roadmap to mitigate it, beginning with the psychological underpinnings of vishing, which primarily include deception and persuasion strategies, continuing with the speech-based approaches that can be used to detect those, as well as the generation and detection of AI-based vishing attempts, and close with a discussion of ethical and legal considerations.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101802"},"PeriodicalIF":3.1,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143839274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Influence of the surprisal power adjustment on spoken word duration in emotional speech in Serbian 塞尔维亚语情绪言语中惊讶力调节对言语持续时间的影响
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-04-11 DOI: 10.1016/j.csl.2025.101803
Jelena Lazić, Sanja Vujnović
{"title":"Influence of the surprisal power adjustment on spoken word duration in emotional speech in Serbian","authors":"Jelena Lazić,&nbsp;Sanja Vujnović","doi":"10.1016/j.csl.2025.101803","DOIUrl":"10.1016/j.csl.2025.101803","url":null,"abstract":"<div><div>Emotional speech analysis has been a topic of interest across multiple disciplines. However, it remains a challenging task due to its complexity and multimodality. Computer systems still struggle with robustness when dealing with emotional speech. Despite being a difficult area of research, the wide range of potential applications, especially nowadays in the era of intelligent agents and humanoid systems, has led to increased interest in this field. With the development of machine learning models, a variety of novel techniques have emerged, including pre-trained language models. In this work, we used these models to research emotional speech analysis from an information-theory perspective. Specifically, we focused on analyzing language processing difficulty, measured by word-level spoken time duration, and its relation to information distribution over speech, measured by word-level surprisal values. We analyzed a dataset of audio recordings in the low-resourced Serbian language, recorded under five different speakers’ emotional states. Seven state-of-the-art machine learning language models were employed to estimate surprisal values, which were then used as predictive parameters for word-level spoken time duration. Our results supported related studies in the English language and indicated that machine learning-estimated surprisal values may be good predictors of speech parameters in Serbian. Furthermore, modulating the power of surprisal values led to different outcomes for various speakers’ emotional states. This suggests potential differences in the role of surprisal values in speech production under different emotional conditions.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101803"},"PeriodicalIF":3.1,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143850116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ViTASA: New benchmark and methods for Vietnamese targeted aspect sentiment analysis for multiple textual domains ViTASA:针对多个文本领域的越南语目标方面情感分析的新基准和方法
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-03-27 DOI: 10.1016/j.csl.2025.101800
Khanh Quoc Tran, Quang Phan-Minh Huynh, Oanh Thi-Hong Le, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
{"title":"ViTASA: New benchmark and methods for Vietnamese targeted aspect sentiment analysis for multiple textual domains","authors":"Khanh Quoc Tran,&nbsp;Quang Phan-Minh Huynh,&nbsp;Oanh Thi-Hong Le,&nbsp;Kiet Van Nguyen,&nbsp;Ngan Luu-Thuy Nguyen","doi":"10.1016/j.csl.2025.101800","DOIUrl":"10.1016/j.csl.2025.101800","url":null,"abstract":"<div><div>Targeted Aspect Sentiment Analysis (TASA) has gained substantial attraction in recent years, fostering diverse studies and technological advancements. However, the development of TASA resources for Vietnamese has been limited. This paper introduces ViTASA, a comprehensive, high-quality dataset designed to catalyze advancements in Vietnamese TASA. ViTASA encompasses over 500,000 target-aspect pairs from social media comments across three key domains: mobile, restaurant, and hotel, thereby addressing critical gaps in existing datasets. Additionally, ViTASA integrates a novel multi-task evaluation framework, posing new challenges and enabling robust model assessments. We present ViTASD, an innovative BERT-based approach optimized for the linguistic features of Vietnamese. Comparative analyses demonstrate that ViTASD significantly outperforms existing state-of-the-art methods, including CG-BERT, QACG-BERT, BERT-pair-QA, BERT-pair-NLI, and a range of zero-shot learning models like Gemma, Llama, Mistral and Qwen. Notably, ViTASD achieves superior macro F1-scores of 61.77%, 41.12%, and 52.64% in the mobile, restaurant, and hotel domains respectively. This study not only highlights the challenges inherent in Vietnamese sentiment analysis but also lays a robust foundation for future research endeavors in this area. In a commitment to advancing TASA technology and enhancing the reliability of digital media analyses, we have made the ViTASA dataset, model checkpoints, and source code openly accessible on GitHub<span><span><sup>1</sup></span></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101800"},"PeriodicalIF":3.1,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143734937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting spatial information and target speaker phoneme loss for multichannel directional speech enhancement and recognition 利用空间信息和目标说话人音素损失进行多通道定向语音增强和识别
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-03-26 DOI: 10.1016/j.csl.2025.101801
Cong Pang , Ye Ni , Lin Zhou , Li Zhao , Feifei Xiong
{"title":"Exploiting spatial information and target speaker phoneme loss for multichannel directional speech enhancement and recognition","authors":"Cong Pang ,&nbsp;Ye Ni ,&nbsp;Lin Zhou ,&nbsp;Li Zhao ,&nbsp;Feifei Xiong","doi":"10.1016/j.csl.2025.101801","DOIUrl":"10.1016/j.csl.2025.101801","url":null,"abstract":"<div><div>Directional speech extraction catches increasing attention recently in multichannel speech separation, as it focuses solely on extracting the target speech to make real-time communication (RTC) and automatic speech recognition (ASR) more productive. This work investigates a real-time multichannel neural framework for directional speech enhancement and recognition by exploiting the explicit spatial information derived from the microphone array geometry, and the implicit spatial information learned from a dedicated narrow-band network. In addition to the traditional signal-based loss functions, we further introduce a loss inspired by the ASR phoneme mismatch to guide the framework training towards the distortion-less target speech signals. Experimental results with simulated datasets show that the proposed framework significantly improves the speech quality of the target speaker locating at the specific direction in noisy and reverberant environments with interfering speakers. The improved ASR results with the real-recorded dataset of live conversations from the CHiME8 MMCSG Challenge further verify the effectiveness of our system for practical applications.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101801"},"PeriodicalIF":3.1,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信