Computer Speech and Language最新文献

Time–Frequency Causal Hidden Markov Model for speech-based Alzheimer’s disease longitudinal detection 基于语音的阿尔茨海默病纵向检测的时频因果隐马尔可夫模型

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-07-19 DOI: 10.1016/j.csl.2025.101862

Yilin Pan , Jiabing Li , Yating Zhang , Zhuoran Tian , Yijia Zhang , Mingyu Lu

{"title":"Time–Frequency Causal Hidden Markov Model for speech-based Alzheimer’s disease longitudinal detection","authors":"Yilin Pan , Jiabing Li , Yating Zhang , Zhuoran Tian , Yijia Zhang , Mingyu Lu","doi":"10.1016/j.csl.2025.101862","DOIUrl":"10.1016/j.csl.2025.101862","url":null,"abstract":"<div><div>Speech deterioration is an early indicator in individuals with Alzheimer’s disease (AD), with progression influenced by various factors, leading to unique trajectories for each individual. To facilitate automated longitudinal detection of AD using speech, we propose an enhanced Hidden Markov Model (HMM), termed the Time-Frequency Causal HMM (TF-CHMM), which models disease-causative acoustic features over time under the Markov property. The TF-CHMM integrates a parallel convolutional neural network as an encoder for spectrograms, extracting both time-domain and frequency-domain features from audio recordings linked to AD. Additionally, it incorporates personal attributes (e.g., age) and clinical diagnosis data (e.g., MMSE scores) as supplementary inputs, disentangling disease-related features from unrelated components through a sequential variational auto-encoder with causal inference. The TF-CHMM is evaluated using the Pitt Corpus, which includes annual visits for each subject with a variable number of longitudinal samples, comprising audio recordings, manual transcriptions, MMSE scores, and age information. Experimental results demonstrated the effectiveness of our designed system, achieving a competitive accuracy of 90.24% and an F1 score of 90.00%. An ablation study further highlighted the efficiency of the parallel convolutional kernels in extracting time–frequency information and emphasized the effectiveness of our longitudinal experimental setup in the AD detection system.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101862"},"PeriodicalIF":3.1,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Linguistically informed automatic speech recognition in Sanskrit 基于语言学的梵语自动语音识别

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-07-12 DOI: 10.1016/j.csl.2025.101861

Rishabh Kumar , Devaraja Adiga , Rishav Ranjan , Amrith Krishna , Ganesh Ramakrishnan , Pawan Goyal , Preethi Jyothi

{"title":"Linguistically informed automatic speech recognition in Sanskrit","authors":"Rishabh Kumar , Devaraja Adiga , Rishav Ranjan , Amrith Krishna , Ganesh Ramakrishnan , Pawan Goyal , Preethi Jyothi","doi":"10.1016/j.csl.2025.101861","DOIUrl":"10.1016/j.csl.2025.101861","url":null,"abstract":"<div><div>The field of Automatic Speech Recognition (ASR) for Sanskrit is marked by distinctive challenges, primarily due to the language’s intricate linguistic and morphological characteristics. Recognizing the burgeoning interest in this domain, we present the ‘Vāksañcayah’ speech corpus, a comprehensive collection that captures the linguistic depth and complexities of Sanskrit. Building upon our prior work, which focused on various acoustic model (AM) and language model (LM) units, we present an enhanced ASR system. This system integrates innovative subword tokenization methods and enriches the search space with linguistic insights. Addressing the issue of high out-of-vocabulary (OOV) rates and the prevalence of infrequently used words in Sanskrit, we employed a subword-based language model. Our approach mitigates these challenges and facilitates the generation of a subword-based search space. While effective in numerous scenarios, this model encounters limitations regarding long-range dependencies and semantic context comprehension. To counter these limitations, we leveraged Sanskrit’s rich morphological framework, thus achieving a more holistic understanding. The subword-based search space is subsequently transformed into a word-based format and augmented with morphological and lexical data, derived from a lexically driven shallow parser. Enhancing this further, we rescore transitions within this enriched space using a supervised morphological parser specifically designed for Sanskrit. Our proposed methodology is currently acclaimed as the most advanced in the realm of Sanskrit ASR, achieving a Word Error Rate (WER) of 12.54 and an improvement of 3.77 absolute points over the previous best. Additionally, we annotated 500 utterances with detailed morphological data and their corresponding lemmas, providing a basis for extensive linguistic analysis.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101861"},"PeriodicalIF":3.1,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis 面向快速会议转录：CHiME-8 NOTSOFAR-1任务的NAIST系统及其分析

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-07-08 DOI: 10.1016/j.csl.2025.101836

Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti

{"title":"Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis","authors":"Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti","doi":"10.1016/j.csl.2025.101836","DOIUrl":"10.1016/j.csl.2025.101836","url":null,"abstract":"<div><div>This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101836"},"PeriodicalIF":3.1,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gnowsis: Multimodal multitask learning for oral proficiency assessments 多模态多任务学习用于口语水平评估

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-07-05 DOI: 10.1016/j.csl.2025.101860

Hiroaki Takatsu , Shungo Suzuki , Masaki Eguchi , Ryuki Matsuura , Mao Saeki , Yoichi Matsuyama

{"title":"Gnowsis: Multimodal multitask learning for oral proficiency assessments","authors":"Hiroaki Takatsu , Shungo Suzuki , Masaki Eguchi , Ryuki Matsuura , Mao Saeki , Yoichi Matsuyama","doi":"10.1016/j.csl.2025.101860","DOIUrl":"10.1016/j.csl.2025.101860","url":null,"abstract":"<div><div>Although oral proficiency assessments are crucial to understand second language (L2) learners’ progress, they are resource-intensive. Herein we propose a multimodal multitask learning model to assess L2 proficiency levels from multiple aspects on the basis of multimodal dialogue data. To construct the model, we first created a dataset of speech samples collected through oral proficiency interviews between Japanese learners of English and a conversational virtual agent. Expert human raters subsequently categorized the samples into the six levels based on the rating scales defined in the Common European Framework of Reference for Languages with respect to proficiency in one holistic and five analytic assessment criteria (vocabulary richness, grammatical accuracy, fluency, goodness of pronunciation, and coherence). The model was trained using this dataset via the multitask learning approach to simultaneously predict the proficiency levels of these language competences from various linguistic features. These features were extracted via multiple encoder modules, which were composed of feature extractors pretrained through various natural language processing tasks such as grammatical error correction, coreference resolution, discourse marker prediction, and pronunciation scoring. In experiments comparing the proposed model to baseline models with a feature extractor pretrained with single modality (textual or acoustic) features, the proposed model outperformed the baseline models. In particular, the proposed model was robust even with limited training data or short dialogues with a smaller number of topics because it considered rich features.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101860"},"PeriodicalIF":3.1,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144588195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Causal analysis of ASR errors for children: Quantifying the impact of physiological, cognitive, and extrinsic factors 儿童ASR错误的原因分析：量化生理、认知和外在因素的影响

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-07-01 DOI: 10.1016/j.csl.2025.101859

Vishwanath Pratap Singh , Md. Sahidullah , Tomi H. Kinnunen

{"title":"Causal analysis of ASR errors for children: Quantifying the impact of physiological, cognitive, and extrinsic factors","authors":"Vishwanath Pratap Singh , Md. Sahidullah , Tomi H. Kinnunen","doi":"10.1016/j.csl.2025.101859","DOIUrl":"10.1016/j.csl.2025.101859","url":null,"abstract":"<div><div>The increasing use of children’s automatic speech recognition (ASR) systems has spurred research efforts to improve the accuracy of models designed for children’s speech in recent years. The current approach utilizes either open-source speech foundation models (SFMs) directly or fine-tuning them with children’s speech data. These SFMs, whether open-source or fine-tuned for children, often exhibit higher word error rates (WERs) compared to adult speech. However, there is a lack of systemic analysis of the cause of this degraded performance of SFMs. Understanding and addressing the reasons behind this performance disparity is crucial for improving the accuracy of SFMs for children’s speech. Our study addresses this gap by investigating the causes of accuracy degradation and the primary contributors to WER in children’s speech. In the first part of the study, we conduct a comprehensive benchmarking study on two self-supervised SFMs (<span>Wav2Vec2.0</span> and <span>Hubert</span>) and two weakly supervised SFMs (<span>Whisper</span> and <span>Massively Multilingual Speech (MMS)</span>) across various age groups on two children speech corpora, establishing the raw data for the causal inference analysis in the second part. In the second part of the study, we analyze the impact of physiological factors (age, gender), cognitive factors (pronunciation ability), and external factors (vocabulary difficulty, background noise, and word count) on SFM accuracy in children’s speech using causal inference. The results indicate that physiology (age) and particular external factor (number of words in audio) have the highest impact on accuracy, followed by background noise and pronunciation ability. Fine-tuning SFMs on children’s speech reduces sensitivity to physiological and cognitive factors, while sensitivity to the number of words in audio persists.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101859"},"PeriodicalIF":3.1,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144588194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance assessment of voice conversion models using speech production-based parameters 使用基于语音生产参数的语音转换模型的性能评估

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-06-28 DOI: 10.1016/j.csl.2025.101853

Ashwini Dasare, K.T. Deepak

{"title":"Performance assessment of voice conversion models using speech production-based parameters","authors":"Ashwini Dasare, K.T. Deepak","doi":"10.1016/j.csl.2025.101853","DOIUrl":"10.1016/j.csl.2025.101853","url":null,"abstract":"<div><div>Voice Conversion (VC) transforms a source voice to sound like a target voice. However, the field requires more standardized objective metrics to evaluate its performance thoroughly. Traditional evaluation methods, such as Mel-Cepstral Distortion (MCD), F0-Root Mean Squared Error (F0RMSE), and Modulated-Spectral Distance (MSD), primarily focus on perceptual features and often overlook speech production attributes. This can result in a mismatch between perceived voice similarity and the physiological aspects of the voice, leading to a reliance on subjective methods like the Mean Opinion Score (MOS). While MOS provides valuable insights, it is resource-intensive and inherently subjective, limiting its practicality for widespread use. This research proposes an objective framework for evaluating voice quality in VC tasks by focusing on key speech production parameters, including jitter, shimmer, harmonics-to-noise ratio, and vocal tract length. Our findings suggest that these parameters, which encapsulate the distinct characteristics of a speaker’s voice, provide a more precise basis for assessing perceptual similarity between converted and target voices. Compared to traditional objective metrics like MCD, MSD, F0RMSE, and also non-intrusive measures like MOSNET, UTMOS, our proposed method consistently shows a correlation with MOS, suggesting that it better aligns with subjective evaluations of voice quality. This presents a more reliable and practical alternative to conventional methods that primarily emphasize perceptual features. This study evaluates how well different VC models, such as StarGANv2-VC, Retrival-based VC, Suno-Bark, and Diff-VC replicate speech production parameters across various languages and accents, including English, Kannada, Hindi, and the low-resource Soliga language. The results provide insights into improving the evaluation of voice conversion technologies by focusing on speech production attributes, helping to bridge the gap between perceptual similarity and physiological accuracy. The proposed work lays the groundwork for developing standardized, objective evaluation methods for VC models based on speech production characteristics.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101853"},"PeriodicalIF":3.1,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mono- and cross-lingual evaluation of representation language models on less-resourced languages 在资源较少的语言上对表示语言模型的单语言和跨语言评价

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-06-27 DOI: 10.1016/j.csl.2025.101852

Matej Ulčar , Aleš Žagar , Carlos S. Armendariz , Andraž Repar , Senja Pollak , Matthew Purver , Marko Robnik-Šikonja

引用次数: 0

Automatic speech recognition in the presence of babble noise and reverberation compared to human speech intelligibility in Spanish 在西班牙语中，与人类语音可理解性相比，在咿呀学语噪声和混响存在下的自动语音识别

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-06-24 DOI: 10.1016/j.csl.2025.101856

Carlos Mena , A.L. Padilla-Ortiz , Felipe Orduña-Bustamante

{"title":"Automatic speech recognition in the presence of babble noise and reverberation compared to human speech intelligibility in Spanish","authors":"Carlos Mena , A.L. Padilla-Ortiz , Felipe Orduña-Bustamante","doi":"10.1016/j.csl.2025.101856","DOIUrl":"10.1016/j.csl.2025.101856","url":null,"abstract":"<div><div>The performance of three representative automatic speech recognition (ASR) systems: NeMo, Wav2Vec, and Whisper, was evaluated for the Spanish language as spoken in the central region of Mexico, in the presence of speech babble noise as a function of signal to noise ratio (SNR) and also separately under different reverberant conditions. NeMo and Wav2Vec were pretrained, or specially fine-tuned for the recognition of Mexican Spanish, as required by the language architectures of these ASR systems, while Whisper was used without requiring such fine-tuning. Speech intelligibility tests with human participants were also carried out on the same speech material and under the same acoustic conditions: noise and reverberation. Character error rate and word error rate metrics were mapped into speech intelligibility scores, speech reception thresholds, and intelligibility slopes, the latter being performance metrics more commonly used in the evaluation of human speech intelligibility. ASR results show profiles of performance vs. SNR which are akin to those found for human listeners. Comparison with speech intelligibility results by human listeners, show speech reception thresholds (signal to noise dB levels corresponding to 50% intelligibility in the presence of acoustic noise) which are higher, showing lower performance relative to humans, by 1.8 dB for Whisper, 3.0 dB for Wav2Vec, 7.0 dB for NeMo. Intelligibility slopes (indicating rate of performance recovery with increasing SNR) were higher for Whisper (13.6%/dB) and Wav2Vec (12.0%/dB), lower for NeMo (5.0%/dB), relative to an intermediate value for humans (9.3%/dB). Performance with reverberated speech indicate reverberation time thresholds (for 50% intelligibility) of 3.1 s for Whisper, 2.6 s for humans, 1.4 s for Wav2Vec, and 1.0 s for NeMo. Whisper is seen to outperform Wav2Vec and NeMo in all aspects, while also outperforming humans in terms of speech intelligibility slope and reverberation threshold, except for speech reception threshold in noise. These results provide performance metrics for the ASR systems included in this study in the context of human speech intelligibility. Also, in view of their good performance, Whisper and Wav2Vec lend themselves to be used in predicting human speech intelligibility in different scenarios by conducting equivalent evaluations through automatic speech recognition.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101856"},"PeriodicalIF":3.1,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Privacy-preserving feature extractor using adversarial pruning for TBI assessment from speech 基于对抗性修剪的语音TBI评估隐私保护特征提取器

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-06-23 DOI: 10.1016/j.csl.2025.101854

Apiwat Ditthapron , Emmanuel O. Agu , Adam C. Lammert

{"title":"Privacy-preserving feature extractor using adversarial pruning for TBI assessment from speech","authors":"Apiwat Ditthapron , Emmanuel O. Agu , Adam C. Lammert","doi":"10.1016/j.csl.2025.101854","DOIUrl":"10.1016/j.csl.2025.101854","url":null,"abstract":"<div><div>Speech is an effective indicator of medical conditions such as Traumatic Brain Injury (TBI), but frequently includes private information, preventing novel passive, real-world assessments using the patient’s smartphone. Privacy research for speech processing has primarily focused on hiding the speaker’s identity, which is utilized in authentication systems and cannot be renewed. Our study extends privacy to include the content of speech, specifically sensitive words during conversation. Prior work has proposed extracting privacy-preserving features via adversarial training, which trains a neural network to defend against attacks on private data that an adversarial network is simultaneously attempting to access. However, adversarial training has an unsolved problem of training instability due to the inherent limitations of minimax optimization. Instead, our study introduces Privacy-Preserving using Adversarial Pruning (PPA-Pruning). Nodes are systematically removed from the network while prioritizing those contributing most to the recognition of personal data from a well-trained feature extractor designed for TBI detection and adversarial tasks. PPA-Pruning was evaluated for various privacy budgets via a differential privacy setup. Notably, PPA-Pruning outperforms baseline methods, including adversarial training and Laplace noise, achieving up to an 11% improvement in TBI detection accuracy at the same privacy level.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101854"},"PeriodicalIF":3.1,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144491288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sentiment classification method based on BERT-CondConv multi-moment state fusion 基于BERT-CondConv多时刻状态融合的情感分类方法

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-06-20 DOI: 10.1016/j.csl.2025.101855

Wang Xiaoyang , Liu Wenfeng

引用次数: 0