{"title":"Gnowsis: Multimodal multitask learning for oral proficiency assessments","authors":"Hiroaki Takatsu , Shungo Suzuki , Masaki Eguchi , Ryuki Matsuura , Mao Saeki , Yoichi Matsuyama","doi":"10.1016/j.csl.2025.101860","DOIUrl":"10.1016/j.csl.2025.101860","url":null,"abstract":"<div><div>Although oral proficiency assessments are crucial to understand second language (L2) learners’ progress, they are resource-intensive. Herein we propose a multimodal multitask learning model to assess L2 proficiency levels from multiple aspects on the basis of multimodal dialogue data. To construct the model, we first created a dataset of speech samples collected through oral proficiency interviews between Japanese learners of English and a conversational virtual agent. Expert human raters subsequently categorized the samples into the six levels based on the rating scales defined in the Common European Framework of Reference for Languages with respect to proficiency in one holistic and five analytic assessment criteria (vocabulary richness, grammatical accuracy, fluency, goodness of pronunciation, and coherence). The model was trained using this dataset via the multitask learning approach to simultaneously predict the proficiency levels of these language competences from various linguistic features. These features were extracted via multiple encoder modules, which were composed of feature extractors pretrained through various natural language processing tasks such as grammatical error correction, coreference resolution, discourse marker prediction, and pronunciation scoring. In experiments comparing the proposed model to baseline models with a feature extractor pretrained with single modality (textual or acoustic) features, the proposed model outperformed the baseline models. In particular, the proposed model was robust even with limited training data or short dialogues with a smaller number of topics because it considered rich features.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101860"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144588195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring knowledge distillation for low-resource multi-modal streaming ASR in the CHiME-8 MMCSG challenge","authors":"Hongbo Lan, Ya Jiang, Jun Du, Qing Wang","doi":"10.1016/j.csl.2025.101837","DOIUrl":"10.1016/j.csl.2025.101837","url":null,"abstract":"<div><div>In the CHiME-8 Multi-modal Conversational Speech Recognition for Smart Glasses (MMCSG) challenge, participants were tasked with achieving real-time transcription of two-person conversations recorded with smart glasses. To address the scarcity of real-world data, we propose a knowledge distillation framework where a non-streaming teacher model, trained on augmented multi-channel audio, guides a streaming student model. Leveraging simulated data with varying overlap rates, the framework employs a logit-based Kullback–Leibler divergence loss alongside mean square error losses on hidden states and attention maps of Fast-Conformer layers to transfer knowledge from the teacher to the student, significantly improving the performance of the audio-only streaming automatic speech recognition (ASR) model. Furthermore, we exploit the synergy and complementarity of inertial measurement unit and audio data by developing a novel multi-modal streaming ASR model. Meanwhile, cross-modal distillation is performed by adopting the non-streaming audio-only teacher to guide the streaming multi-modal student. Experimental results demonstrate that our proposed multi-modal fusion and teacher-student learning framework effectively enhance the performance of streaming ASR models. Notably, our approach secured the first place in the sub-track of the CHiME-8 MMCSG challenge.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101837"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu
{"title":"Three-stage modular speaker diarization collaborating with front-end techniques in the CHiME-8 NOTSOFAR-1 challenge","authors":"Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu","doi":"10.1016/j.csl.2025.101863","DOIUrl":"10.1016/j.csl.2025.101863","url":null,"abstract":"<div><div>We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at <span><span>https://github.com/rywang99/USTC-NERCSLIP_CHiME-8</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101863"},"PeriodicalIF":3.4,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144772518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Raw acoustic-articulatory multimodal dysarthric speech recognition","authors":"Zhengjun Yue , Erfan Loweimi , Zoran Cvetkovic , Jon Barker , Heidi Christensen","doi":"10.1016/j.csl.2025.101839","DOIUrl":"10.1016/j.csl.2025.101839","url":null,"abstract":"<div><div>Automatic speech recognition (ASR) for dysarthric speech is challenging. The acoustic characteristics of dysarthric speech are highly variable and there are often fewer distinguishing cues between phonetic tokens. Multimodal ASR utilises the data from other modalities to facilitate the task when a single acoustic modality proves insufficient. Articulatory information, which encapsulates knowledge about the speech production process, may constitute such a complementary modality. Although multimodal acoustic-articulatory ASR has received increasing attention recently, incorporating real articulatory data is under-explored for dysarthric speech recognition. This paper investigates the effectiveness of multimodal acoustic modelling using real dysarthric speech articulatory information in combination with acoustic features, especially raw signal representations which are more informative than classic features, leading to learning representations tailored to dysarthric ASR. In particular, various raw acoustic-articulatory multimodal dysarthric speech recognition systems are developed and compared with similar systems with hand-crafted features. Furthermore, the difference between dysarthric and typical speech in terms of articulatory information is systematically analysed by using a statistical space distribution indicator called Maximum Articulator Motion Range (MAMR). Additionally, we used mutual information analysis to investigate the robustness and phonetic information content of the articulatory features, offering insights that support feature selection and the ASR results. Experimental results on the widely used TORGO dysarthric speech dataset show that combining the articulatory and raw acoustic features at the empirically found optimal fusion level achieves a notable performance gain, leading to up to 7.6% and 12.8% relative word error rate (WER) reduction for dysarthric and typical speech, respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101839"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti
{"title":"Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis","authors":"Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti","doi":"10.1016/j.csl.2025.101836","DOIUrl":"10.1016/j.csl.2025.101836","url":null,"abstract":"<div><div>This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101836"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jagabandhu Mishra , Manasi Chhibber , Hye-jin Shim , Tomi H. Kinnunen
{"title":"Towards explainable spoofed speech attribution and detection: A probabilistic approach for characterizing speech synthesizer components","authors":"Jagabandhu Mishra , Manasi Chhibber , Hye-jin Shim , Tomi H. Kinnunen","doi":"10.1016/j.csl.2025.101840","DOIUrl":"10.1016/j.csl.2025.101840","url":null,"abstract":"<div><div>We propose an explainable probabilistic framework for characterizing spoofed speech by decomposing it into probabilistic attribute embeddings. Unlike raw high-dimensional countermeasure embeddings, which lack interpretability, the proposed probabilistic attribute embeddings aim to detect specific speech synthesizer components, represented through high-level attributes and their corresponding values. We use these probabilistic embeddings with four classifier back-ends to address two downstream tasks: spoofing detection and spoofing attack attribution. The former is the well-known bonafide-spoof detection task, whereas the latter seeks to identify the source method (generator) of a spoofed utterance. We additionally use Shapley values, a widely used technique in machine learning, to quantify the relative contribution of each attribute value to the decision-making process in each task. Results on the ASVspoof2019 dataset demonstrate the substantial role of waveform generator, conversion model outputs, and inputs in spoofing detection; and inputs, speaker, and duration modeling in spoofing attack attribution. In the detection task, the probabilistic attribute embeddings achieve 99.7% balanced accuracy and 0.22% equal error rate (EER), closely matching the performance of raw embeddings (99.9% balanced accuracy and 0.22% EER). Similarly, in the attribution task, our embeddings achieve 90.23% balanced accuracy and 2.07% EER, compared to 90.16% and 2.11% with raw embeddings. These results demonstrate that the proposed framework is both inherently explainable by design and capable of achieving performance comparable to raw CM embeddings.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101840"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144298503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yilin Pan , Jiabing Li , Yating Zhang , Zhuoran Tian , Yijia Zhang , Mingyu Lu
{"title":"Time–Frequency Causal Hidden Markov Model for speech-based Alzheimer’s disease longitudinal detection","authors":"Yilin Pan , Jiabing Li , Yating Zhang , Zhuoran Tian , Yijia Zhang , Mingyu Lu","doi":"10.1016/j.csl.2025.101862","DOIUrl":"10.1016/j.csl.2025.101862","url":null,"abstract":"<div><div>Speech deterioration is an early indicator in individuals with Alzheimer’s disease (AD), with progression influenced by various factors, leading to unique trajectories for each individual. To facilitate automated longitudinal detection of AD using speech, we propose an enhanced Hidden Markov Model (HMM), termed the Time-Frequency Causal HMM (TF-CHMM), which models disease-causative acoustic features over time under the Markov property. The TF-CHMM integrates a parallel convolutional neural network as an encoder for spectrograms, extracting both time-domain and frequency-domain features from audio recordings linked to AD. Additionally, it incorporates personal attributes (e.g., age) and clinical diagnosis data (e.g., MMSE scores) as supplementary inputs, disentangling disease-related features from unrelated components through a sequential variational auto-encoder with causal inference. The TF-CHMM is evaluated using the Pitt Corpus, which includes annual visits for each subject with a variable number of longitudinal samples, comprising audio recordings, manual transcriptions, MMSE scores, and age information. Experimental results demonstrated the effectiveness of our designed system, achieving a competitive accuracy of 90.24% and an F1 score of 90.00%. An ablation study further highlighted the efficiency of the parallel convolutional kernels in extracting time–frequency information and emphasized the effectiveness of our longitudinal experimental setup in the AD detection system.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101862"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang
{"title":"Universal constituency treebanking and parsing: A pilot study","authors":"Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang","doi":"10.1016/j.csl.2025.101826","DOIUrl":"10.1016/j.csl.2025.101826","url":null,"abstract":"<div><div>Universal language processing is crucial for developing models that work across multiple languages. However, universal constituency parsing has lagged due to the lack of annotated universal constituency (UC) treebanks. To address this, we propose two cost-effective approaches. First, we unify existing annotated language-specific treebanks using phrase label mapping to create UC trees, but this is limited to only a handful of languages. Second, we develop a novel method to convert Universal Dependency (UD) treebanks into UC treebanks using large language models (LLMs) with syntactic knowledge, enabling the construction of UC treebanks for over 150 languages. We adopt the graph-based max margin model as our baseline and introduce a language adapter to fine-tune the universal parser. Our experiments show that the language adapter maintains performance for high-resource languages and improves performance for low-resource languages. We evaluate different scales of multilingual pre-trained models, confirming the effectiveness and robustness of our approach. In summary, we conduct the first pilot study on universal constituency parsing, introducing novel methods for creating and utilizing UC treebanks, thereby advancing treebanking and parsing methodologies.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101826"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An item response theory framework to evaluate automatic speech recognition systems against speech difficulty","authors":"Chaina Santos Oliveira, Ricardo B.C. Prudêncio","doi":"10.1016/j.csl.2025.101817","DOIUrl":"10.1016/j.csl.2025.101817","url":null,"abstract":"<div><div>Evaluating the performance of Automatic Speech Recognition (ASR) systems is very relevant for selecting good techniques and understanding their advantages and limitations. ASR systems are usually evaluated by adopting test sets of audio speeches, ideally with different difficulty levels. In this sense, it is important to analyse whether a system under test correctly transcribes easy test speeches, while being robust to the most difficult ones. In this paper, a novel framework is proposed for evaluating ASR systems, which covers two complementary issues: (1) to measure the difficulty of each test speech; and (2) to analyse each ASR system’s performance against the difficulty level. Regarding the first issue, the framework measures speech difficulty by adopting Item Response Theory (IRT). Regarding the second issue, the Recognizer Characteristic Curve (RCC) is proposed, which is a plot of the ASR system’s performance versus speech difficulty. ASR performance is further analysed by a two-dimensional plot, in which speech difficulty is decomposed by IRT into sentence difficulty and speaker quality. In the experiments, the proposed framework was applied in a test set produced by adopting text-to-speech tools, with diverse speakers and sentences. Additionally, noise injection was applied to produce test items with even higher difficulty levels. In the experiments, noise injection actually increases difficulty and generates a wide variety of speeches to assess ASR performance. However, it is essential to pay attention that high noise levels can lead to an unreliable evaluation. The proposed plots were helpful for both identifying robust ASR systems as well as for choosing the noise level that results in both diversity and reliability.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101817"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144072019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paige Tuttösí , Mantaj Dhillon , Luna Sang , Shane Eastwood , Poorvi Bhatia , Quang Minh Dinh , Avni Kapoor , Yewon Jin , Angelica Lim
{"title":"BERSting at the screams: A benchmark for distanced, emotional and shouted speech recognition","authors":"Paige Tuttösí , Mantaj Dhillon , Luna Sang , Shane Eastwood , Poorvi Bhatia , Quang Minh Dinh , Avni Kapoor , Yewon Jin , Angelica Lim","doi":"10.1016/j.csl.2025.101815","DOIUrl":"10.1016/j.csl.2025.101815","url":null,"abstract":"<div><div>Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 h of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101815"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}