Yucong Zhang , Xin Zou , Jinshan Yang , Wenjun Chen , Juan Liu , Faya Liang , Ming Li
{"title":"Multimodal laryngoscopic video analysis for assisted diagnosis of vocal fold paralysis","authors":"Yucong Zhang , Xin Zou , Jinshan Yang , Wenjun Chen , Juan Liu , Faya Liang , Ming Li","doi":"10.1016/j.csl.2025.101891","DOIUrl":"10.1016/j.csl.2025.101891","url":null,"abstract":"<div><div>This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS),<span><span><sup>2</sup></span></span> a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS’s ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101891"},"PeriodicalIF":3.4,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145266429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A speech prediction model based on codec modeling and transformer decoding","authors":"Heming Wang , Yufeng Yang , DeLiang Wang","doi":"10.1016/j.csl.2025.101892","DOIUrl":"10.1016/j.csl.2025.101892","url":null,"abstract":"<div><div>Speech prediction is essential for tasks like packet loss concealment and algorithmic delay compensation. This paper proposes a novel prediction algorithm that leverages a speech codec and transformer decoder to autoregressively predict missing frames. Unlike text-guided methods requiring auxiliary information, the proposed approach operates solely on speech for prediction. A comparative study is conducted to evaluate and compare the proposed and existing speech prediction methods on packet loss concealment (PLC) and frame-wise speech prediction tasks. Comprehensive experiments demonstrate that the proposed model achieves superior prediction results, which are substantially better than other state-of-the-art baselines, including on a recent PLC challenge. We also systematically examine factors influencing prediction performance, including context window lengths, prediction lengths, and training and inference strategies.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101892"},"PeriodicalIF":3.4,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AIPO: Automatic Instruction Prompt Optimization by model itself with “Gradient Ascent”","authors":"Kyeonghye Park, Daeshik Kim","doi":"10.1016/j.csl.2025.101889","DOIUrl":"10.1016/j.csl.2025.101889","url":null,"abstract":"<div><div>Large language models (LLMs) can perform a variety of tasks such as summarization, translation, and question answering by generating answers with user input prompt. The text that is used as input to the model, including instruction, is called input prompt. There are two types of input prompt: zero-shot prompting provides a question with no examples, on the other hand, few-shot prompting provides a question with multiple examples. The way the input prompt is set can have a big impact on the accuracy of the model generation. The relevant research is called prompt engineering. Prompt engineering, especially prompt optimization is used to find the optimal prompts optimized for each model and task. Manually written prompts could be optimal prompts, but it is time-consuming and expensive. Therefore, research is being conducted on automatically generating prompts that are as effective as human-crafted ones for each task. We propose <em>Automatic Instruction Prompt Optimization</em> (AIPO), which allows the model to generate an initial prompt directly through instruction induction when given a task in a zero-shot setting and then improve the initial prompt to optimal prompt for model based on the “gradient ascent” algorithm. With the final prompt generated by AIPO, we achieve more accurate generation than manual prompt on benchmark datasets regardless of the output format.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101889"},"PeriodicalIF":3.4,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continual End-to-End Speech-to-Text translation using augmented bi-sampler","authors":"Balaram Sarkar, Pranav Karande, Ankit Malviya, Chandresh Kumar Maurya","doi":"10.1016/j.csl.2025.101885","DOIUrl":"10.1016/j.csl.2025.101885","url":null,"abstract":"<div><div>Speech-to-Text (ST) is the translation of speech in one language to text in another language. Earlier models for ST used a pipeline approach combining automatic speech recognition (ASR) and machine translation (MT). Such models suffer from cascade error propagation, high latency and memory consumption. Therefore, End-to-End (E2E) ST models were proposed. Adapting E2E ST models to new language pairs results in deterioration of performance on the previously trained language pairs. This phenomenon is called Catastrophic Forgetting (CF). Therefore, we need ST models that can learn continually. The present work proposes a novel continual learning (CL) framework for E2E ST tasks. The core idea behind our approach combines proportional-language sampling (PLS), random sampling (RS), and augmentation. RS helps in performing well on the current task by sampling aggressively from it. PLS is used to sample equal proportion from past task data but it may cause over-fitting. To mitigate that, a combined approach of PLS+RS is used, dubbed as continual bi-sampler (CBS). However, CBS still suffers from over-fitting due to repeated samples from the past tasks. Therefore, we apply various augmentation strategies combined with CBS which we call continual augmented bi-sampler (CABS). We perform experiments on 4 language pairs of MuST-C (One to Many) and mTEDx (Many to Many) datasets and achieve a gain of <strong>68.38%</strong> and <strong>41%</strong> respectively in the average BLEU score compared to baselines. CABS also mitigates the average forgetting by <strong>82.2%</strong> in MuST-C dataset compared to the Gradient Episodic Memory (GEM) baseline. The results show that the proposed CL based E2E ST ensures knowledge retention across previously trained languages. To the best of our knowledge, E2E ST model has not been studied before in a CL setup.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101885"},"PeriodicalIF":3.4,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment","authors":"Yusuke Yasuda, Tomoki Toda","doi":"10.1016/j.csl.2025.101888","DOIUrl":"10.1016/j.csl.2025.101888","url":null,"abstract":"<div><div>Preference-based subjective evaluation is a key method for reliably evaluating generative media. However, its huge number of pair combinations makes it prohibitively difficult to apply to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and the allocation of evaluation volumes with online learning in a crowdsourcing environment. We use a preference-based online learning method based on a sorting algorithm to identify the total order of systems with minimum sample volumes. Our online learning algorithm supports parallel and asynchronous executions under fixed-budget conditions required for crowdsourcing. Our experiment on the preference-based subjective evaluation of synthetic speech on naturalness shows that our method successfully optimizes the preference-based test by reducing the number of pair combinations from 351 to 83 and allocating optimal evaluation volumes for each pair ranging from 30 to 663 without compromising evaluation errors and wasting budget allocations.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101888"},"PeriodicalIF":3.4,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145105453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jennifer Williams , Tayyaba Azim , Anna-Maria Piskopani , Richard Hyde , Shuo Zhang , Zack Hodari
{"title":"Public perceptions of speech technology trust in the United Kingdom","authors":"Jennifer Williams , Tayyaba Azim , Anna-Maria Piskopani , Richard Hyde , Shuo Zhang , Zack Hodari","doi":"10.1016/j.csl.2025.101884","DOIUrl":"10.1016/j.csl.2025.101884","url":null,"abstract":"<div><div>Speech technology is now pervasive throughout the world, impacting a variety of socio-technical use-cases. Speech technology is a broad term encompassing capabilities that translate, analyse, transcribe, generate, modify, enhance, or summarise human speech. Many of the technical features and the possibility of speech data misuse are not often revealed to the users of such systems. When combined with the rapid development of AI and the plethora of use-cases where speech-based AI systems are now being applied, the consequence is that researchers, regulators, designers and government policymakers still have little understanding of the public’s perception of speech technology. Our research explores the public’s perceptions of trust in speech technology by asking people about their experiences, awareness of their rights, their susceptibility to being harmed, their expected behaviour, and ethical choices governing behavioural responsibility. We adopt a multidisciplinary lens to our work, in order to present a fuller picture of the United Kingdom (UK) public perspective through a series of socio-technical scenarios in a large-scale survey. We analysed survey responses from 1,000 participants from the UK, where people from different walks of life were asked to reflect on existing, emerging, and hypothetical speech technologies. Our socio-technical scenarios are designed to provoke and stimulate debate and discussion on principles of trust, privacy, responsibility, fairness, and transparency. We found that gender is a statistically significant factor correlated to awareness of rights and trust. We also found that awareness of rights is statistically correlated to perceptions of trust and responsible use of speech technology. By understanding the notions of responsibility in behaviour and differing perspectives of trust, our work encapsulates the current state of public acceptance of speech technology in the UK. Such an understanding has the potential to affect how regulatory and policy frameworks are developed, how the UK invests in its AI research and development ecosystem, and how speech technology that is developed within the UK might be received by global stakeholders.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101884"},"PeriodicalIF":3.4,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145105455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advanced noise-aware speech enhancement algorithm via adaptive dictionary selection based on compressed sensing in the time-frequency domain","authors":"Naser Sharafi , Salman Karimi , Samira Mavaddati","doi":"10.1016/j.csl.2025.101887","DOIUrl":"10.1016/j.csl.2025.101887","url":null,"abstract":"<div><div>Speech signal enhancement and noise reduction play a vital role in applications such as telecommunications, audio broadcasting, and military systems. This paper proposes a novel speech enhancement method based on compressive sensing principles in the time-frequency domain, incorporating sparse representation and dictionary learning techniques. The proposed method constructs an optimal dictionary of atoms that can sparsely represent clean speech signals. A key component of the framework is a noise-aware block, which leverages multiple pre-trained noise dictionaries along with the spectral features of noisy speech to build a composite noise model. It isolates noise-only segments, computes their sparse coefficients, and evaluates energy contributions across all candidate dictionaries. The dictionary with the highest energy is then selected as the dominant noise type. The algorithm dynamically adapts to handle unseen noise types by selecting the most similar noise structure present in the dictionary pool, offering a degree of generalization. The proposed system is evaluated under three clearly defined scenarios: (i) using a baseline sparse representation model, (ii) incorporating dictionary learning with a fixed noise model, and (iii) employing the full adaptive noise-aware framework. The method demonstrates strong performance against nine types of noise (non-stationary, periodic, and static) across a wide SNR range (-5 dB to +20 dB). On average, it yields 16.71 % improvement in PESQ and 3.39 % in STOI compared to existing techniques. Simulation results confirm the superiority of the proposed approach in both noise suppression and speech intelligibility, highlighting its potential as a robust tool for speech enhancement in real-world noisy environments.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101887"},"PeriodicalIF":3.4,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145105456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Electroglottography-based speech content classification using stacked BiLSTM-FCN network for clinical applications","authors":"Srinidhi Kanagachalam, Deok-Hwan Kim","doi":"10.1016/j.csl.2025.101886","DOIUrl":"10.1016/j.csl.2025.101886","url":null,"abstract":"<div><div>In this study, we introduce a newer approach to classify the human speech contents based on Electroglottographic (EGG) signals. In general, identifying human speech using EGG signals is challenging and unaddressed, as human speech may contain pathology due to vocal cord damage. In this paper, we propose a deep learning-based approach called Stacked BiLSTM-FCN to identify the speech contents for both the healthy and pathological person. This deep learning-based technique integrates a recurrent neural network (RNN) that utilizes bidirectional long short-term memory (BiLSTM) with a convolutional network that uses a squeeze and excitation layer, learns features from the EGG signals and classifies them based on the learned features. Experiments on the existing Saarbruecken Voice Database (SVD) dataset containing healthy and pathological voices with different pitch levels showed an accuracy of 92.09% on the proposed model. Further evaluations prove the generalization performance and robustness of the proposed method for application in clinical laboratories to identify speech contents with different pathologies and varying accent types.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101886"},"PeriodicalIF":3.4,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Chaves-Villota , Ana Jimenez-Martín , Mario Jojoa-Acosta , Alfonso Bahillo , Juan Jesús García-Domínguez
{"title":"Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review","authors":"Andrea Chaves-Villota , Ana Jimenez-Martín , Mario Jojoa-Acosta , Alfonso Bahillo , Juan Jesús García-Domínguez","doi":"10.1016/j.csl.2025.101873","DOIUrl":"10.1016/j.csl.2025.101873","url":null,"abstract":"<div><div>Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, research on ER systems has focused on multiple key aspects, including the development of high-quality emotional databases, the selection of robust feature representations, and the implementation of advanced classifiers leveraging AI-based techniques. Despite this progress in research, ER still faces significant challenges and gaps that must be addressed to develop accurate and reliable systems. To systematically assess these critical aspects, particularly those centered on AI-based techniques, we employed the PRISMA methodology. Thus, we include journal and conference papers that provide essential insights into key parameters required for dataset development, involving emotion modeling (categorical or dimensional), the type of speech data (natural, acted, or elicited), the most common modalities integrated with acoustic and linguistic data from speech and the technologies used. Similarly, following this methodology, we identified the key representative features that serve as critical emotional information sources in both modalities. For acoustic, this included those extracted from the time and frequency domains, while for linguistic, earlier embeddings and the most common transformer models were considered. In addition, Deep Learning (DL) and attention-based methods were analyzed for both. Given the importance of effectively combining these diverse features for improving ER, we then explore fusion techniques based on the level of abstraction. Specifically, we focus on traditional approaches, including feature-, decision-, DL-, and attention-based fusion methods. Next, we provide a comparative analysis to assess the performance of the approaches included in our study. Our findings indicate that for the most commonly used datasets in the literature: IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weighted accuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challenges and propose future guidelines that could enhance the performance of ER systems using acoustic and linguistic features from speech.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101873"},"PeriodicalIF":3.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145004000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparative study on noise-augmented training and its effect on adversarial robustness in ASR systems","authors":"Karla Pizzi , Matías Pizarro , Asja Fischer","doi":"10.1016/j.csl.2025.101869","DOIUrl":"10.1016/j.csl.2025.101869","url":null,"abstract":"<div><div>In this study, we investigate whether noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition (ASR) systems. We conduct a comparative analysis of the adversarial robustness of four different ASR architectures, each trained under three different augmentation conditions: (1) background noise, speed variations, and reverberations; (2) speed variations only; (3) no data augmentation. We then evaluate the robustness of all resulting models against attacks with white-box or black-box adversarial examples. Our results demonstrate that noise augmentation not only enhances model performance on noisy speech but also improves the model’s robustness to adversarial attacks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101869"},"PeriodicalIF":3.4,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145010591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}