{"title":"A novel approach to cross-linguistic transfer learning for hope speech detection in Tamil and Malayalam","authors":"Jothi Prakash V., Arul Antran Vijay S.","doi":"10.1016/j.csl.2025.101870","DOIUrl":"10.1016/j.csl.2025.101870","url":null,"abstract":"<div><div>In the field of Natural Language Processing (NLP), accurately identifying hope speech in low-resource languages such as Tamil and Malayalam poses significant challenges. This research introduces the Sentimix Transformer (SentT), a novel transformer-based model designed for detecting hope speech in YouTube comments composed in Tamil and Malayalam, two linguistically rich but computationally low-resource languages. The SentT model innovatively combines multilingual BERT (mBERT) embeddings with specialized cultural and code-mixing adaptations to effectively process the linguistic diversity and complexities inherent in code-mixed data. This approach allows SentT to capture nuanced expressions of hope by integrating domain-specific knowledge into the transformer framework. Our methodology extends traditional transformer architectures by incorporating a unique ensemble of embeddings that encapsulate linguistic, cultural, and code-mixing attributes, significantly enhancing the model’s sensitivity to context and cultural idioms. We validate our approach using the Hope Speech dataset for Equality, Diversity, and Inclusion (HopeEDI), which includes diverse comments from social media. The SentT model achieves an impressive accuracy of 93.4%, a precision of 92.7%, and a recall of 94.1% outperforming existing models and demonstrating its efficacy in handling the subtleties of hope speech in multilingual contexts. The model’s architecture and the results of extensive evaluations not only underscore its effectiveness but also its potential as a scalable solution for similar tasks in other low-resource languages. Through this research, we contribute to the broader field of sentiment analysis by demonstrating the potential of tailored, context-aware models in enhancing digital communication’s positivity and inclusiveness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101870"},"PeriodicalIF":3.4,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-time audio enhancement framework for vocal performances based on LSTM and time-frequency masking algorithm","authors":"Zan Huang","doi":"10.1016/j.csl.2025.101871","DOIUrl":"10.1016/j.csl.2025.101871","url":null,"abstract":"<div><div>This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101871"},"PeriodicalIF":3.4,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144842508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic speech-based alcohol intoxication detection for automotive safety applications","authors":"Brian Stasak , Julien Epps","doi":"10.1016/j.csl.2025.101872","DOIUrl":"10.1016/j.csl.2025.101872","url":null,"abstract":"<div><div>There is a responsibility to advance automatic alcohol intoxication screening capabilities in modern automobiles to reduce the high rate of alcohol-related accidents and fatalities worldwide. Automatic speech-based alcohol intoxication screening offers a tremendous safety opportunity in the automotive industry due to its non-invasive convenience, comparatively inexpensive cost, and rapid result processing. Using the Alcohol Language Corpus (ALC), this study examines automatic alcohol intoxication classification based on participants’ non-intoxicated/intoxicated omni-microphone speech recordings. Experimentation of many different speech features (e.g., glottal, landmarks, linguistic, prosodic, spectral, syllabic, vocal tract coordination) across different blood alcohol concentration (BAC) ranges and specific verbal tasks show significant changes as participants' BAC increases. Intoxicated participants produce lower average fundamental frequency (F0) with an increase in F0 frequency modulation, breathiness and creakiness voice qualities in intoxicated recordings when compared to their non-intoxicated recordings. For the picture description and tongue twister tasks, manual irregularity disfluency and pause linguistic features significantly increase in intoxicated recordings. Further, for all verbal tasks, automatically extracted syllabic pause features show a significant increase in intoxicated recordings. Implementation of task-dependent support vector machine classifier model with a ≥0.001 BAC 'intoxication' sensitivity threshold increases alcohol classification by up to 8% absolute gain over a task-agnostic approach. Moreover, intoxication classification results demonstrate that task-dependent modeling with majority vote decision improves classification accuracy with up to 20% absolute gain depending on task when compared to file-by-file task-agnostic method results reported previously in ALC baseline studies that used higher quality headset microphone recordings.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101872"},"PeriodicalIF":3.4,"publicationDate":"2025-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144829660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCoT2S: Self-correcting Text-to-SQL parsing by leveraging LLMs","authors":"Chunlin Zhu , Yuming Lin , Yaojun Cai , You Li","doi":"10.1016/j.csl.2025.101865","DOIUrl":"10.1016/j.csl.2025.101865","url":null,"abstract":"<div><div>Text-to-SQL parsing, which converts natural language questions into executable SQL queries, has emerged as a critical technology for enabling non-technical users to interact with databases effectively. Although recent advances in this field have shown promise, existing models still struggle with complex semantic understanding and accurate SQL generation, particularly in handling schema relationships and join operations. To address these challenges, we propose SCoT2S (Self-Correcting Text-to-SQL), a novel framework that leverages large language models to automatically identify and rectify errors in SQL query generation. Through systematic error analysis of existing Text-to-SQL models, we identify that schema linking and join operations account for more than 70% of parsing errors. Our SCoT2S framework addresses these issues through a three-stage approach: initial SQL generation, comprehensive error detection, and targeted correction using large language models. This approach enables real-time error identification and correction during the parsing process. Extensive experiments demonstrate the effectiveness of the proposed SCoT2S in the Spider benchmark data set. Specifically, SCoT2S shows significant improvements, with a 2.8% increase in EM scores and a 4.0% increase in EX scores compared to current state-of-the-art methods.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101865"},"PeriodicalIF":3.4,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144750492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu
{"title":"Three-stage modular speaker diarization collaborating with front-end techniques in the CHiME-8 NOTSOFAR-1 challenge","authors":"Ruo-Yu Wang , Jun Du , Shu-Tong Niu , Gao-Bin Yang , Tian Gao , Jia Pan , Qing-Feng Liu","doi":"10.1016/j.csl.2025.101863","DOIUrl":"10.1016/j.csl.2025.101863","url":null,"abstract":"<div><div>We propose a modular speaker diarization framework that collaborates with front-end techniques in a three-stage process, designed for the challenging CHiME-8 NOTSOFAR-1 acoustic environment. The framework leverages the strengths of deep learning based speech separation systems and traditional speech signal processing techniques to provide more accurate initializations for the Neural Speaker Diarization (NSD) system at each stage, thereby enhancing the performance of a single-channel NSD system. Firstly, speaker overlap detection and Continuous Speech Separation (CSS) are applied to the multichannel speech to obtain clearer single-speaker speech segments for the Clustering-based Speaker Diarization (CSD), followed by the first NSD decoding. Next, the binary speaker masks from the first decoding are used to initialize a complex Angular Center Gaussian Mixture Model (cACGMM) to estimate speaker masks on the multi-channel speech. Using Mask-to-VAD post-processing techniques, we achieve per-speaker speech activity with reduced speaker error (SpkErr), followed by a second NSD decoding. Finally, the second decoding results are used to Guide Source Separation (GSS) to produce per-speaker speech segments. Short utterances containing one word or fewer are filtered, and the remaining speech segments are re-clustered for the final NSD decoding. We present evaluation results progressively explored from the CHiME-8 NOTSOFAR-1 challenge, demonstrating the effectiveness of our modular diarization system and its contribution to improving speech recognition performance. The code will be open-sourced at <span><span>https://github.com/rywang99/USTC-NERCSLIP_CHiME-8</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101863"},"PeriodicalIF":3.4,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144772518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yilin Pan , Jiabing Li , Yating Zhang , Zhuoran Tian , Yijia Zhang , Mingyu Lu
{"title":"Time–Frequency Causal Hidden Markov Model for speech-based Alzheimer’s disease longitudinal detection","authors":"Yilin Pan , Jiabing Li , Yating Zhang , Zhuoran Tian , Yijia Zhang , Mingyu Lu","doi":"10.1016/j.csl.2025.101862","DOIUrl":"10.1016/j.csl.2025.101862","url":null,"abstract":"<div><div>Speech deterioration is an early indicator in individuals with Alzheimer’s disease (AD), with progression influenced by various factors, leading to unique trajectories for each individual. To facilitate automated longitudinal detection of AD using speech, we propose an enhanced Hidden Markov Model (HMM), termed the Time-Frequency Causal HMM (TF-CHMM), which models disease-causative acoustic features over time under the Markov property. The TF-CHMM integrates a parallel convolutional neural network as an encoder for spectrograms, extracting both time-domain and frequency-domain features from audio recordings linked to AD. Additionally, it incorporates personal attributes (e.g., age) and clinical diagnosis data (e.g., MMSE scores) as supplementary inputs, disentangling disease-related features from unrelated components through a sequential variational auto-encoder with causal inference. The TF-CHMM is evaluated using the Pitt Corpus, which includes annual visits for each subject with a variable number of longitudinal samples, comprising audio recordings, manual transcriptions, MMSE scores, and age information. Experimental results demonstrated the effectiveness of our designed system, achieving a competitive accuracy of 90.24% and an F1 score of 90.00%. An ablation study further highlighted the efficiency of the parallel convolutional kernels in extracting time–frequency information and emphasized the effectiveness of our longitudinal experimental setup in the AD detection system.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101862"},"PeriodicalIF":3.1,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144687179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Linguistically informed automatic speech recognition in Sanskrit","authors":"Rishabh Kumar , Devaraja Adiga , Rishav Ranjan , Amrith Krishna , Ganesh Ramakrishnan , Pawan Goyal , Preethi Jyothi","doi":"10.1016/j.csl.2025.101861","DOIUrl":"10.1016/j.csl.2025.101861","url":null,"abstract":"<div><div>The field of Automatic Speech Recognition (ASR) for Sanskrit is marked by distinctive challenges, primarily due to the language’s intricate linguistic and morphological characteristics. Recognizing the burgeoning interest in this domain, we present the ‘Vāksañcayah’ speech corpus, a comprehensive collection that captures the linguistic depth and complexities of Sanskrit. Building upon our prior work, which focused on various acoustic model (AM) and language model (LM) units, we present an enhanced ASR system. This system integrates innovative subword tokenization methods and enriches the search space with linguistic insights. Addressing the issue of high out-of-vocabulary (OOV) rates and the prevalence of infrequently used words in Sanskrit, we employed a subword-based language model. Our approach mitigates these challenges and facilitates the generation of a subword-based search space. While effective in numerous scenarios, this model encounters limitations regarding long-range dependencies and semantic context comprehension. To counter these limitations, we leveraged Sanskrit’s rich morphological framework, thus achieving a more holistic understanding. The subword-based search space is subsequently transformed into a word-based format and augmented with morphological and lexical data, derived from a lexically driven shallow parser. Enhancing this further, we rescore transitions within this enriched space using a supervised morphological parser specifically designed for Sanskrit. Our proposed methodology is currently acclaimed as the most advanced in the realm of Sanskrit ASR, achieving a Word Error Rate (WER) of 12.54 and an improvement of 3.77 absolute points over the previous best. Additionally, we annotated 500 utterances with detailed morphological data and their corresponding lemmas, providing a basis for extensive linguistic analysis.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101861"},"PeriodicalIF":3.1,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti
{"title":"Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis","authors":"Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti","doi":"10.1016/j.csl.2025.101836","DOIUrl":"10.1016/j.csl.2025.101836","url":null,"abstract":"<div><div>This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101836"},"PeriodicalIF":3.1,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gnowsis: Multimodal multitask learning for oral proficiency assessments","authors":"Hiroaki Takatsu , Shungo Suzuki , Masaki Eguchi , Ryuki Matsuura , Mao Saeki , Yoichi Matsuyama","doi":"10.1016/j.csl.2025.101860","DOIUrl":"10.1016/j.csl.2025.101860","url":null,"abstract":"<div><div>Although oral proficiency assessments are crucial to understand second language (L2) learners’ progress, they are resource-intensive. Herein we propose a multimodal multitask learning model to assess L2 proficiency levels from multiple aspects on the basis of multimodal dialogue data. To construct the model, we first created a dataset of speech samples collected through oral proficiency interviews between Japanese learners of English and a conversational virtual agent. Expert human raters subsequently categorized the samples into the six levels based on the rating scales defined in the Common European Framework of Reference for Languages with respect to proficiency in one holistic and five analytic assessment criteria (vocabulary richness, grammatical accuracy, fluency, goodness of pronunciation, and coherence). The model was trained using this dataset via the multitask learning approach to simultaneously predict the proficiency levels of these language competences from various linguistic features. These features were extracted via multiple encoder modules, which were composed of feature extractors pretrained through various natural language processing tasks such as grammatical error correction, coreference resolution, discourse marker prediction, and pronunciation scoring. In experiments comparing the proposed model to baseline models with a feature extractor pretrained with single modality (textual or acoustic) features, the proposed model outperformed the baseline models. In particular, the proposed model was robust even with limited training data or short dialogues with a smaller number of topics because it considered rich features.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101860"},"PeriodicalIF":3.1,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144588195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vishwanath Pratap Singh , Md. Sahidullah , Tomi H. Kinnunen
{"title":"Causal analysis of ASR errors for children: Quantifying the impact of physiological, cognitive, and extrinsic factors","authors":"Vishwanath Pratap Singh , Md. Sahidullah , Tomi H. Kinnunen","doi":"10.1016/j.csl.2025.101859","DOIUrl":"10.1016/j.csl.2025.101859","url":null,"abstract":"<div><div>The increasing use of children’s automatic speech recognition (ASR) systems has spurred research efforts to improve the accuracy of models designed for children’s speech in recent years. The current approach utilizes either open-source speech foundation models (SFMs) directly or fine-tuning them with children’s speech data. These SFMs, whether open-source or fine-tuned for children, often exhibit higher word error rates (WERs) compared to adult speech. However, there is a lack of systemic analysis of the cause of this degraded performance of SFMs. Understanding and addressing the reasons behind this performance disparity is crucial for improving the accuracy of SFMs for children’s speech. Our study addresses this gap by investigating the causes of accuracy degradation and the primary contributors to WER in children’s speech. In the first part of the study, we conduct a comprehensive benchmarking study on two self-supervised SFMs (<span>Wav2Vec2.0</span> and <span>Hubert</span>) and two weakly supervised SFMs (<span>Whisper</span> and <span>Massively Multilingual Speech (MMS)</span>) across various age groups on two children speech corpora, establishing the raw data for the causal inference analysis in the second part. In the second part of the study, we analyze the impact of physiological factors (age, gender), cognitive factors (pronunciation ability), and external factors (vocabulary difficulty, background noise, and word count) on SFM accuracy in children’s speech using causal inference. The results indicate that physiology (age) and particular external factor (number of words in audio) have the highest impact on accuracy, followed by background noise and pronunciation ability. Fine-tuning SFMs on children’s speech reduces sensitivity to physiological and cognitive factors, while sensitivity to the number of words in audio persists.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101859"},"PeriodicalIF":3.1,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144588194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}