{"title":"Identifying robust and dataset-independent acoustic biomarkers of depression through multi-model feature consensus analysis","authors":"Musyysb Yousufi, Rytis Maskeliunas","doi":"10.1016/j.csl.2026.101960","DOIUrl":"10.1016/j.csl.2026.101960","url":null,"abstract":"<div><div>Speech is one of the most abundant and natural sources of acoustic data containing prosodic and spectral information. Acoustic features help diagnose mental and emotional health issues. In recent years, several researchers have looked at speech features as a way to detect depression. However, most of the frameworks only work with the data on which they were trained and do not work with new speakers, recording devices, or languages. This research aims to identify reliable and interpretable acoustic features that serve as stable indicators of depression in various speech datasets.</div><div>This study used two publicly available datasets, E-DAIC and MODMA. A total of 107 handcrafted prosodic, spectral, and voice quality acoustic features were extracted from 4-second segments, with 1-second overlap for long audios and padding for short audio clips. Subject-aware pre-processing was used to prevent speaker level overlap. Five feature selection algorithms were used and their findings were integrated using a consensus-based rank aggregation framework to identify consistent depression related features in both datasets. The resulting set of characteristics was evaluated using four classifier architectures through a K-sweep analysis. The adaptation of the correlation alignment domain was used to reduce distribution mismatches by aligning second-order statistics between the source and target domains, allowing robust cross-dataset transfer evaluation. Bidirectional cross-dataset evaluation demonstrated effective generalization in both transfer directions. Models trained on E-DAIC achieved F1=0.49-0.52 in MODMA (92%–94% of within-dataset performance), while MODMA trained models achieved F1=0.34–0.35 in E-DAIC, exceeding the baseline within-dataset of E-DAIC. The negative domain loss observed in E-DAIC (domain loss = −0.22 to −0.24) reflects high intra-dataset heterogeneity from naturalistic recording conditions rather than poor generalizability. These findings demonstrate that robust acoustic depression biomarkers can be learned from diverse datasets, enabling the detection of cross-linguistic depression.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101960"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of single-channel speech enhancement algorithm in noisy acoustic environments","authors":"Yi-Fu Zhao, Guang-Hui Dong, Nan Liu","doi":"10.1016/j.csl.2026.101955","DOIUrl":"10.1016/j.csl.2026.101955","url":null,"abstract":"<div><div>In speech enhancement, Transformers and Self-Attention-based denoising networks are widely used and perform well, and speech enhancement serves as a valuable front-end for speech recognition. However, existing dual-branch architectures lack sufficient natural speech phase extraction due to the phase spectrum’s sensitivity and easy compensation, and traditional dilated convolution architectures are unsuitable for resource-constrained devices, creating an urgent need for lightweight alternatives. Thus, this paper proposes TFEM-PHASEN-MINI, a discrete dual-branch phase extraction architecture based on the Base and Detail Feature Modules. It uses DilatedReparamBlock to replace the Dense Encoder’s dilated convolution module, balancing computational efficiency and performance by fusing Convolutional Neural Networks and Transformers. It also designs a time-frequency feature extraction module to verify integrating speech recognition modules into speech enhancement, and adds a Phase Enhancement Module to address insufficient phase-spectrum speech phoneme feature extraction (caused by magnitude spectrum over-compensation) via parallel phase estimation. On the VoiceBank+DEMAND dataset, it achieves scores of 3.44, 4.72, 4.18, 17.13, 2.10, and 0.96 for PESQ, CSIG, COVL, FWSSNR, CEPS, and STOI, respectively. On the DNS-Challenge dataset, it attains scores of 3.20 and 3.57 for WB-PESQ and NB-PESQ, respectively. On the EARS-WHAM testset and its blind testset, it improves the metrics of PESQ, CSIG, CBAK, COVL, SSNR, FWSSNR, CEPS, and STOI by 0.56, 1.00, 0.94, 0.83, 8.42, 5.26, 0.21, and 0.15 respectively, and achieves non-intrusive metrics (Overall Quality of 3.80, Noisiness of 4.18, Discontinuity of 4.32, Coloration of 3.85, Loudness of 3.45), showing optimal generalization. Though it has relatively lower CBAK and SSNR on the VoiceBank+DEMAND dataset, it remains overall advanced. Computational complexity and device inference tests verify the balance between its computational efficiency and accuracy.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101955"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Larissa Guder , João Paulo Aires , Hígor Uélinton da Silva , Felipe Meneguzzi , Dalvan Griebler
{"title":"Sentence representations for semantic textual similarity: A systematic review","authors":"Larissa Guder , João Paulo Aires , Hígor Uélinton da Silva , Felipe Meneguzzi , Dalvan Griebler","doi":"10.1016/j.csl.2026.101970","DOIUrl":"10.1016/j.csl.2026.101970","url":null,"abstract":"<div><div>In natural language processing (NLP), generating semantically-rich representations of sentences can improve performance on multiple tasks, such as question answering, duplicate detection, sentiment analysis, and machine translation. Recent approaches to NLP using machine learning can produce text representations that carry syntactic and semantic information. This article surveys recent works on generating sentence representations for semantic textual similarity tasks. We conduct our survey using a systematic literature review approach. We retrieve papers from several digital libraries and summarize their key techniques and findings. We propose a taxonomy to facilitate the understanding of the semantic textual similarity task on the sentence level. In our analysis, we describe the current state-of-the-art in sentence representation for semantic textual similarity and propose a guideline for working on this task.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101970"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-linguistic analysis of prosodic features based on wavelet prominence: A study of L2 English and L1 Sindhi lexical stress using large language & deep learning models","authors":"Abdul Malik Abbasi , Imtiaz Husain","doi":"10.1016/j.csl.2026.101953","DOIUrl":"10.1016/j.csl.2026.101953","url":null,"abstract":"<div><div>This study presents a cross-linguistic analysis of prosodic features in English and Sindhi, with an emphasis on modelling lexical stress and rhythmic prominence using advanced artificial intelligence techniques. The proposed framework integrates wavelet-based signal processing with Deep Learning architectures and prosodic embeddings extracted from Large Language Models (LLMs). We address the lack of computational research on Sindhi lexical stress and investigate the central research question of whether a fused representation of CWT-based prosodic prominence and Wav2Vec 2.0 embeddings can accurately model stress patterns and support cross-lingual transfer to L2 English. Trained on lexical stress patterns in Sindhi, the system is applied to English speech data from speakers with diverse first-language (L1) backgrounds to automatically predict syllable prominence. Experimental results show that the hybrid model combining continuous wavelet transform (CWT) features with BiLSTM and Wav2Vec 2.0 embeddings achieves a stress classification accuracy of 92.1%, outperforming baseline models by a significant margin. Feature ablation analysis confirms duration as the most predictive cue in Sindhi, while pitch dominates in English. The model's prominence estimates show strong alignment with human-assigned CEFR ratings (Pearson’s r = 0.78, <em>p</em> < 0.001), validating its perceptual reliability. These findings underscore the effectiveness of interpretable, AI-driven approaches for multilingual prosody modelling and highlight their practical utility in speech synthesis, automatic speech recognition, and language learning technologies.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101953"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling the temporal envelope of sub-band signals for improving the performance of children’s speech recognition system in zero-resource scenario","authors":"Kaustav Das, Biswaranjan Pattanayak, Gayadhar Pradhan","doi":"10.1016/j.csl.2026.101954","DOIUrl":"10.1016/j.csl.2026.101954","url":null,"abstract":"<div><div>Children’s KWS (keyword spotting) systems often experience a significant decline in performance when acoustic mismatches occur between training and testing conditions. Though multiple factors are liable for creating such mismatches, pitch and speaking rate are the two predominant sources of acoustic mismatch. This work proposes a pitch-robust acoustic feature by computing the temporal envelope of sub-band signals to develop a children’s KWS system in the zero-resource scenario. To accomplish this, the speech signal is first passed through <span><math><mi>M</mi></math></span> non-overlapping band-pass filters arranged in a linear scale to break it down into sub-bands. Then, the temporal envelope of each sub-band signal is estimated with the application of the Hilbert transform. The mean values of the estimated envelopes are computed over an analysis frame and logarithmically compressed to yield an <span><math><mi>M</mi></math></span>-dimensional feature vector per analysis frame, here termed the logarithmically compressed averaged temporal envelope of sub-band signals (LC-ATESS). The efficacy of the proposed LC-ATESS feature is tested on the deep neural network-hidden Markov model-based acoustic model. The observed KWS results are superior to conventional Mel-frequency cepstral coefficients (MFCC), MFCC computed after spectral smoothing, and features calculated from single-frequency spectra, both with and without data augmentation, across clean and noisy test scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101954"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HRDF-MER: Hierarchical feature refinement and cascaded dynamic fusion for multimodal emotion recognition","authors":"Jianjun Lei , Zhenmei Mu , Ying Wang","doi":"10.1016/j.csl.2026.101978","DOIUrl":"10.1016/j.csl.2026.101978","url":null,"abstract":"<div><div>Multimodal Emotion Recognition (MER) is challenged by modality misalignment, shallow temporal cue modeling, and inefficient fusion. This paper proposes HRDF-MER, a framework that integrates hierarchical refinement and cascaded dynamic fusion for more robust emotion recognition. To improve cross-modal alignment and unimodal representation, HRDF-MER introduces a novel Hierarchical Cross-modal Feature Refinement (HCFR) strategy, which integrates Cross-modal Adaptive Alignment (CAA) and Hierarchical Feature Enhancement (HFE). The CAA module employs multi-head cross-attention to construct hierarchical correlation matrices for precise acoustic-text alignment, and the HFE employs a Transformer with cross-modal residual connections to further enhance unimodal representations for robust feature learning. We further propose a Cascaded Multimodal Dynamic Fusion (CMDF) strategy, where a cross-attention encoder captures fine-grained inter-modal dependencies and a gated fusion unit adaptively weights modalities to progressively produce highly discriminative multimodal representations. Moreover, a multi-objective training scheme is proposed to jointly optimize feature alignment and classification by integrating Cross-modal Label Contrastive Loss (CLC Loss) with cross-entropy loss. Extensive experiments on the IEMOCAP and MELD datasets demonstrate that HRDF-MER significantly outperforms state-of-the-art models, while ablation studies further confirm the effectiveness and necessity of each proposed component.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101978"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improve NNLMs by text generation from pre-trained language models","authors":"Minguang Song, Yunxin Zhao","doi":"10.1016/j.csl.2026.101969","DOIUrl":"10.1016/j.csl.2026.101969","url":null,"abstract":"<div><div>Large pre-trained language models (PLMs) are capable of learning rich linguistic knowledge and have shown their power on automatic speech recognition (ASR). However, the high computing cost of large PLMs restricts their direct applications in certain real world scenarios of low computing resources. In this paper, we propose an effective approach for leveraging PLMs in text-generation based data augmentation to improve task-specific neural network language model (NNLM) for ASR, which is an important problem that has not yet been well addressed. Our data augmentation method first fine-tune a PLM on in-domain data to generate in-domain-like text, and then select novel sentences according to a desired distribution of sentence perplexities. The selected text and the in-domain data form an augmented dataset for training a lightweight NNLM. Since the fine-tuned PLM captures in it both general and in-domain linguistic knowledge, adequately utilizing such generated texts in model training promotes the generalization capability of NNLMs. We have evaluated our proposed approach on the ASR tasks of Wall Street Journal (WSJ) and Augmented Multiparty Interaction (AMI) meeting. Our experimental results have shown large reductions on word error rate and perplexity by the lightweight augmented NNLMs, demonstrating the promising potential of high-performance NNLM deployment for ASR in resource-constrained environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101969"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An exhaustive evaluation method for open-domain LLM dialogue by constructing recursive CoT","authors":"Shengjie Zhao , Zhenping Xie","doi":"10.1016/j.csl.2026.101957","DOIUrl":"10.1016/j.csl.2026.101957","url":null,"abstract":"<div><div>In recent years, evaluation methods based on large language models (LLMs) have demonstrated advanced performance in reference-free evaluation of open-domain dialogue quality. However, existing approaches often rely on simple, manually crafted evaluation instructions, lacking the depth and diversity to reflect complex human thinking processes. To address these limitations, we propose the Rec-CoT-Eval framework, a reference-free method for evaluating dialogue quality that automatically constructs a Chain-of-Thought (CoT) through interaction with LLMs. Unlike existing methods that depend on manually crafted instructions, our approach enables the automatic construction of a CoT for evaluation. We treat each evaluation metric as a root task and use prompts to guide the LLMs in recursively decomposing it into sub-problems in a top-down manner. By solving these sub-problems, a comprehensive evaluation CoT is constructed. Ultimately, this CoT is used as a prompt for the LLMs, enabling them to act as dialogue quality evaluation agents and perform reference-free evaluation of target dialogues. Furthermore, the framework incorporates an optional human-computer interaction mechanism, designed to meet the need for fine-grained and personalized customization of evaluation criteria in practical industrial applications. This mechanism allows evaluators to dynamically modify the generated CoT when necessary, integrating expert knowledge to enhance evaluation accuracy and personalization. Experimental results demonstrate that our proposed method achieves a higher correlation with human judgments and outperforms existing approaches.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101957"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gonzalo Nieto Montero, Santiago Hernández, Juan Casal
{"title":"Improvements in Spanish audio transcription workflows: Integrating preprocessing, LLM-based correction, and speaker diarization and identification","authors":"Gonzalo Nieto Montero, Santiago Hernández, Juan Casal","doi":"10.1016/j.csl.2026.101966","DOIUrl":"10.1016/j.csl.2026.101966","url":null,"abstract":"<div><div>Robust, richly-annotated transcription of Spanish broadcast audio remains difficult under realistic conditions even for state-of-the-art multilingual ASR systems. This paper advances Spanish speech transcription through a framework that couples (i) targeted audio preprocessing, (ii) large language model (LLM) post-correction with deterministic verification, and (iii) diarization plus speaker identity assignment to produce both more accurate and more informative transcripts. First, we show that applying HDemucs vocal isolation followed by band-limited filtering improves WhisperX (Whisper large-v3) performance on modern RTVE broadcast test sets, reaching 10.82% WER on RTVE2022DB (2.79% relative reduction vs. WhisperX) and 10.36% on RTVE2020DB. To define the boundaries of this approach, we also evaluate the NVIDIA Canary-1B-v2 model, observing that these gains are model-dependent. Second, we introduce a verification algorithm for LLM-based correction that constrains the model to a purely corrective role via normalized-text equivalence checks and bounded edit-distance acceptance, preserving pipeline determinism while retaining LLM benefits. On two formatting-noise stress tests (RTVE2017-week subtitles and noisy VoxPopuli-es), this mechanism nearly halves case- and punctuation-sensitive error rate and identifies a robust operating region for tolerance thresholds. Third, we enrich transcripts with speaker names by combining WhisperX/pyannote diarization with audio-embedding matching and complementary transcript-driven (LLM) identification, achieving 29.92% DER on RTVE2022DB, an improvement over the challenge reference baseline. Together, the modules deliver cleaner, speaker-aware transcripts that surpass the strongest zero-shot WhisperX baseline and illustrate how carefully combining off-the-shelf models can advance Spanish ASR in realistic conditions without training.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101966"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attention based convolutional residual squeeze excited capsule network for aspect based sentiment classification in Malayalam movie reviews","authors":"Sharika TR , Julia Punithamalar Dhas","doi":"10.1016/j.csl.2026.101952","DOIUrl":"10.1016/j.csl.2026.101952","url":null,"abstract":"<div><div>One of the main functions of Natural Language Processing (NLP) is sentiment analysis, which extracts attitudes, ideas, views or judgments about a given topic. The Internet is a vast and unstructured information source full of text documents, including evaluations and opinions. Firstly, the input texts are pre-processed using an efficient NLP method such as tokenization, stemming, removal of empty sets, stop words removal and morphological segmentation. These pre-processed texts serve as the input for the feature extraction stage. Using the three methods of Improved Term Frequency-Inverse Document Frequency (ITF-IDF), Latent Semantic Analysis (LSA) and Extended Bidirectional Encoder Representations from Transformers (E-BERT), the review-based features are extracted. Aspect-based features are extracted from the review text using the Aspect Related Feature (ARF) extraction method. By enhancing term weights with improved frequency scaling, the model improves on regular TF-IDF and includes more subtle contextual meanings and relationships with words. Finally, applying both types of features, a new Attention-based Convolutional Residual Squeeze Excited Capsule Network (A-CR-SECapNet) model is created to classify sentiment polarities as positive, negative and neutral. The Convolutional Residual Module captures spatial relationships to learn deeper networks that mitigate vanishing gradients. The SE Module improves the attentiveness of the network by dynamically reweighting the channel-wise information from features that correlate with important sentiment variables. The CapNet preserves the spatial relationships between words to maintain the dependence of sentiment between features. Finally, the performance of the model is further improved by fine-tuning the parameters using the Modified Gazelle Optimization (MGO) optimization method. In the results section, the proposed model is compared to the existing model in terms of precision, f1-score, accuracy, recall, mean absolute error (MSE) and mean absolute percentage error (MAPE). The proposed model produced the best results, demonstrating its superiority.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101952"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}