{"title":"AraFastQA: a transformer model for question-answering for Arabic language using few-shot learning","authors":"Asmaa Alrayzah , Fawaz Alsolami , Mostafa Saleh","doi":"10.1016/j.csl.2025.101857","DOIUrl":"10.1016/j.csl.2025.101857","url":null,"abstract":"<div><div>In recent years, numerous studies have developed pre-trained language models (PLMs) for Arabic natural language processing (NLP) tasks, including question-answering (QA), but often overlooking the challenge of data scarcity. This study introduces the Arabic Few-Shot QA (AraFastQA) pre-trained language model to confront the challenge of limited resources in Arabic QA tasks. The primary contributions of this study involve developing an PLM based on a few-shot learning (FSL) approach to address the challenge of low-resource datasets in Arabic QA. Moreover, this study contributes to the developing of Arabic benchmark few-shot QA datasets. By using the few-shot datasets, we compare the AraFastQA PLM with the state-of-art Arabic PLMs such that AraBERT, AraELECTRA, and XLM-Roberta. We evaluated AraFastQA and state-of-art models on two Arabic benchmark datasets that are Arabic reading comprehension (ARCD) and the typologically diverse question answering (TyDiQA). The obtained experimental results show that AraFastQA outperforms other models across eight training sample sizes of the Arabic benchmark datasets. For instance, our proposed PLM achieves 73.2 of F1-score on TyDi QA with only 1024 training examples while the highest accuracy of other models (AraELECTRA) achieves 56.1. For the full training dataset of ARCD dataset, AraFastQA improves accuracy by 9 %, 3 %, and 10 % of AraBERT, AraELECTRA, and XLM-Roberta respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101857"},"PeriodicalIF":3.1,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144470963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik
{"title":"Predicting accentedness and comprehensibility through ASR scores and acoustic features","authors":"Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik","doi":"10.1016/j.csl.2025.101858","DOIUrl":"10.1016/j.csl.2025.101858","url":null,"abstract":"<div><div>Accentedness and comprehensibility scales are widely used in measuring the oral proficiency of second language (L2) learners, including learners of English as a Second Language (ESL). In this paper, we focus on gaining a better understanding of the concepts of accentedness and comprehensibility by developing and applying automatic measures to ESL utterances produced by Indonesian learners. We extracted features both on the segmental and the suprasegmental (fundamental frequency, loudness, energy et al.) levels to investigate which features are actually related to expert judgments on accentedness and comprehensibility. Automatic Speech Recognition (ASR) pronunciation scores based on the traditional Kaldi Time Delay Neural Network (TDNN) model and on the End-to-End Whisper model were applied, and data-driven methods were used by combining acoustic features extracted by the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and Praat. The experimental results showed that Whisper outperformed the Kaldi-TDNN model. The Whisper model gave the best results for predicting comprehensibility on the basis of phone distance, and the best results for predicting accentedness on the basis of grapheme distance. Combining segmental and suprasegmental features improved the results, yielding different feature rankings for comprehensibility and accentedness. In our final step of analysis, we included differences between utterances and learners as random effects in a mixed linear regression model. Exploiting these information sources yielded a substantial improvement in predicting both comprehensibility and accentedness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101858"},"PeriodicalIF":3.1,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144470962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-turn response selection with Language Style and Topic Aware enhancement","authors":"Weiwei Li, Yuzhong Chen, Junjie Xu, Jiayuan Zhong, Chen Dong","doi":"10.1016/j.csl.2025.101842","DOIUrl":"10.1016/j.csl.2025.101842","url":null,"abstract":"<div><div>The multi-turn response selection is an important component in retrieval-based human–computer dialogue systems. Most recent models adopt the utilization of pre-trained language models to acquire fine-grained semantic information within diverse dialogue contexts, thereby enhancing the precision of response selection. However, effectively leveraging the language style information of speakers along with the topic information in the dialogue context to enhance the semantic understanding capability of pre-trained language models still poses a significant challenge that requires resolution. To address this challenge, we propose a BERT-based Language Style and Topic Aware (BERT-LSTA) model for multi-turn response selection. BERT-LSTA augments BERT with two distinctive modules: the Language Style Aware (LSA) module and the Question-oriented Topic Window Selection (QTWS) module. The LSA module introduces a contrastive learning method to learn the latent language style information from distinct speakers in the dialogue. The QTWS module proposes a topic window segmentation algorithm to segment the dialogue context into topic windows, which facilitates the capacity of BERT-LSTA to refine and incorporate relevant topic information for response selection. Experimental results on two public benchmark datasets demonstrate that BERT-LSTA outperforms all state-of-the-art baseline models across various metrics. Furthermore, ablation studies reveal that the LSA module significantly improves performance by capturing speaker-specific language styles, while the QTWS module enhances topic relevance by filtering irrelevant contextual information.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101842"},"PeriodicalIF":3.1,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144298502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minerva 2 for speech and language tasks","authors":"Rhiannon Mogridge, Anton Ragni","doi":"10.1016/j.csl.2025.101843","DOIUrl":"10.1016/j.csl.2025.101843","url":null,"abstract":"<div><div>Most artificial neural networks do not directly incorporate a memory of previous experiences, instead using training data to parameterise a model, and then discarding the training data prior to inference. While some recent models have included a memory, this has typically been added to an already highly parameterised model. An alternative option is to use a purely memory-based model, and then add parameters. This has been shown to work for Minerva 2, a simple, non-parametric, memory-based model which has been widely used in the field of human psychology. We revisit the use of Minerva 2 for speech and language tasks, drawing comparisons between Minerva 2 and other architectures, and showing that an iterative process that Minerva 2 uses for inference is a close relative of deep equilibrium models. We assess parameterised models based on Minerva 2, including a sequence model inspired by Minerva 2’s similarity to the transformer architecture, which shows promising results.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101843"},"PeriodicalIF":3.1,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144314149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Polok , Dominik Klement , Martin Kocour , Jiangyu Han , Federico Landini , Bolaji Yusuf , Matthew Wiesner , Sanjeev Khudanpur , Jan Černocký , Lukáš Burget
{"title":"DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition","authors":"Alexander Polok , Dominik Klement , Martin Kocour , Jiangyu Han , Federico Landini , Bolaji Yusuf , Matthew Wiesner , Sanjeev Khudanpur , Jan Černocký , Lukáš Burget","doi":"10.1016/j.csl.2025.101841","DOIUrl":"10.1016/j.csl.2025.101841","url":null,"abstract":"<div><div>Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model’s focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model’s target-speaker ASR capabilities while maintaining Whisper’s accuracy and robustness on single-speaker data.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101841"},"PeriodicalIF":3.1,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144314148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jagabandhu Mishra , Manasi Chhibber , Hye-jin Shim , Tomi H. Kinnunen
{"title":"Towards explainable spoofed speech attribution and detection: A probabilistic approach for characterizing speech synthesizer components","authors":"Jagabandhu Mishra , Manasi Chhibber , Hye-jin Shim , Tomi H. Kinnunen","doi":"10.1016/j.csl.2025.101840","DOIUrl":"10.1016/j.csl.2025.101840","url":null,"abstract":"<div><div>We propose an explainable probabilistic framework for characterizing spoofed speech by decomposing it into probabilistic attribute embeddings. Unlike raw high-dimensional countermeasure embeddings, which lack interpretability, the proposed probabilistic attribute embeddings aim to detect specific speech synthesizer components, represented through high-level attributes and their corresponding values. We use these probabilistic embeddings with four classifier back-ends to address two downstream tasks: spoofing detection and spoofing attack attribution. The former is the well-known bonafide-spoof detection task, whereas the latter seeks to identify the source method (generator) of a spoofed utterance. We additionally use Shapley values, a widely used technique in machine learning, to quantify the relative contribution of each attribute value to the decision-making process in each task. Results on the ASVspoof2019 dataset demonstrate the substantial role of waveform generator, conversion model outputs, and inputs in spoofing detection; and inputs, speaker, and duration modeling in spoofing attack attribution. In the detection task, the probabilistic attribute embeddings achieve 99.7% balanced accuracy and 0.22% equal error rate (EER), closely matching the performance of raw embeddings (99.9% balanced accuracy and 0.22% EER). Similarly, in the attribution task, our embeddings achieve 90.23% balanced accuracy and 2.07% EER, compared to 90.16% and 2.11% with raw embeddings. These results demonstrate that the proposed framework is both inherently explainable by design and capable of achieving performance comparable to raw CM embeddings.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101840"},"PeriodicalIF":3.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144298503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Raw acoustic-articulatory multimodal dysarthric speech recognition","authors":"Zhengjun Yue , Erfan Loweimi , Zoran Cvetkovic , Jon Barker , Heidi Christensen","doi":"10.1016/j.csl.2025.101839","DOIUrl":"10.1016/j.csl.2025.101839","url":null,"abstract":"<div><div>Automatic speech recognition (ASR) for dysarthric speech is challenging. The acoustic characteristics of dysarthric speech are highly variable and there are often fewer distinguishing cues between phonetic tokens. Multimodal ASR utilises the data from other modalities to facilitate the task when a single acoustic modality proves insufficient. Articulatory information, which encapsulates knowledge about the speech production process, may constitute such a complementary modality. Although multimodal acoustic-articulatory ASR has received increasing attention recently, incorporating real articulatory data is under-explored for dysarthric speech recognition. This paper investigates the effectiveness of multimodal acoustic modelling using real dysarthric speech articulatory information in combination with acoustic features, especially raw signal representations which are more informative than classic features, leading to learning representations tailored to dysarthric ASR. In particular, various raw acoustic-articulatory multimodal dysarthric speech recognition systems are developed and compared with similar systems with hand-crafted features. Furthermore, the difference between dysarthric and typical speech in terms of articulatory information is systematically analysed by using a statistical space distribution indicator called Maximum Articulator Motion Range (MAMR). Additionally, we used mutual information analysis to investigate the robustness and phonetic information content of the articulatory features, offering insights that support feature selection and the ASR results. Experimental results on the widely used TORGO dysarthric speech dataset show that combining the articulatory and raw acoustic features at the empirically found optimal fusion level achieves a notable performance gain, leading to up to 7.6% and 12.8% relative word error rate (WER) reduction for dysarthric and typical speech, respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101839"},"PeriodicalIF":3.1,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sentiment analysis for live video comments with variational residual representations","authors":"Changfan Luo , Ling Fang , Bensheng Qiu","doi":"10.1016/j.csl.2025.101838","DOIUrl":"10.1016/j.csl.2025.101838","url":null,"abstract":"<div><div>Live video comment (LVC) is valuable for public opinion analysis, communication, and user engagement. Analyzing the sentiment in LVC is crucial for understanding their content, especially when strong emotions are involved. However, compared to normal text, LVC exhibits a stronger real-time nature, as well as context-dependent and cross-modal misalignment. Conventional sentiment analysis methods rely solely on textual information and explicit context, yet current multi-modal sentiment analysis models are insufficient to discriminate context and align multi-modal information. To address these challenges, we propose a novel variational residual fusion network based on a variational autoencoder for sentiment analysis of LVCs. Especially, an autofilter module is introduced in the encoder to filter out useful surrounding comments as contextual information for the target comment. A residual fusion module is embedded between the encoder and decoder to discriminate the most relevant visual information, facilitating the alignment of multi-modal information and thereby enhancing the learning of target comment representation. Furthermore, our method follows a multi-task learning scheme to help the model reinforce the representation of the target comments and improve the effectiveness of sentiment analysis. Extensive experiments suggest the effectiveness of the proposed framework in this work. <span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101838"},"PeriodicalIF":3.1,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144263942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring knowledge distillation for low-resource multi-modal streaming ASR in the CHiME-8 MMCSG challenge","authors":"Hongbo Lan, Ya Jiang, Jun Du, Qing Wang","doi":"10.1016/j.csl.2025.101837","DOIUrl":"10.1016/j.csl.2025.101837","url":null,"abstract":"<div><div>In the CHiME-8 Multi-modal Conversational Speech Recognition for Smart Glasses (MMCSG) challenge, participants were tasked with achieving real-time transcription of two-person conversations recorded with smart glasses. To address the scarcity of real-world data, we propose a knowledge distillation framework where a non-streaming teacher model, trained on augmented multi-channel audio, guides a streaming student model. Leveraging simulated data with varying overlap rates, the framework employs a logit-based Kullback–Leibler divergence loss alongside mean square error losses on hidden states and attention maps of Fast-Conformer layers to transfer knowledge from the teacher to the student, significantly improving the performance of the audio-only streaming automatic speech recognition (ASR) model. Furthermore, we exploit the synergy and complementarity of inertial measurement unit and audio data by developing a novel multi-modal streaming ASR model. Meanwhile, cross-modal distillation is performed by adopting the non-streaming audio-only teacher to guide the streaming multi-modal student. Experimental results demonstrate that our proposed multi-modal fusion and teacher-student learning framework effectively enhance the performance of streaming ASR models. Notably, our approach secured the first place in the sub-track of the CHiME-8 MMCSG challenge.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101837"},"PeriodicalIF":3.1,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang
{"title":"Universal constituency treebanking and parsing: A pilot study","authors":"Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang","doi":"10.1016/j.csl.2025.101826","DOIUrl":"10.1016/j.csl.2025.101826","url":null,"abstract":"<div><div>Universal language processing is crucial for developing models that work across multiple languages. However, universal constituency parsing has lagged due to the lack of annotated universal constituency (UC) treebanks. To address this, we propose two cost-effective approaches. First, we unify existing annotated language-specific treebanks using phrase label mapping to create UC trees, but this is limited to only a handful of languages. Second, we develop a novel method to convert Universal Dependency (UD) treebanks into UC treebanks using large language models (LLMs) with syntactic knowledge, enabling the construction of UC treebanks for over 150 languages. We adopt the graph-based max margin model as our baseline and introduce a language adapter to fine-tune the universal parser. Our experiments show that the language adapter maintains performance for high-resource languages and improves performance for low-resource languages. We evaluate different scales of multilingual pre-trained models, confirming the effectiveness and robustness of our approach. In summary, we conduct the first pilot study on universal constituency parsing, introducing novel methods for creating and utilizing UC treebanks, thereby advancing treebanking and parsing methodologies.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101826"},"PeriodicalIF":3.1,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}