{"title":"Early identification of bulbar motor dysfunction in ALS: An approach using AFM signal decomposition","authors":"Shaik Mulla Shabber , Mohan Bansal","doi":"10.1016/j.specom.2025.103246","DOIUrl":"10.1016/j.specom.2025.103246","url":null,"abstract":"<div><div>Amyotrophic lateral sclerosis (ALS) is an aggressive neurodegenerative disorder that impacts the nerve cells in the brain and spinal cord that control muscle movements. Early ALS symptoms include speech and swallowing difficulties, and sadly, the disease is incurable and fatal in some instances. This study aims to construct a predictive model for identifying speech dysarthria and bulbar motor dysfunction in ALS patients, using speech signals as a non-invasive biomarker. Utilizing an amplitude and frequency modulated (AFM) signal decomposition model, the study identifies distinctive characteristics crucial for monitoring and diagnosing ALS. The study focuses on classifying ALS patients and healthy controls (HC) through a machine-learning approach, employing the TORGO database for analysis. Recognizing speech signals as potential biomarkers for ALS detection, the study aims to achieve early identification without invasive measures. An ensemble learning classifier attains a remarkable 97% accuracy in distinguishing between ALS and HC based on features extracted using the AFM signal model.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103246"},"PeriodicalIF":2.4,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Zhang , Chao Pan , Jingdong Chen , Jacob Benesty
{"title":"An update rule for multiple source variances estimation using microphone arrays","authors":"Fan Zhang , Chao Pan , Jingdong Chen , Jacob Benesty","doi":"10.1016/j.specom.2025.103245","DOIUrl":"10.1016/j.specom.2025.103245","url":null,"abstract":"<div><div>This paper addresses the problem of time-varying variance estimation in scenarios with multiple speech sources and background noise using a microphone array, which is an important issue in speech enhancement. Under the optimal principle of maximum likelihood (ML), the variance estimation under the general cases occurs no explicit formula, and all the variances require to be updated iteratively. Inspired by the fixed-point iteration (FPI) method, we derive an update rule for variance estimation by introducing a dummy term and exploiting the ML condition. Insights into the update rule is investigated and the relationship with the variance estimates under least-squares (LS) principle is presented. Finally, by simulations, we show that the resulting variance update rule is very efficient and effective, which requires only a few iterations to converge, and the estimation error is very close to the Cramér–Rao Bound (CRB).</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103245"},"PeriodicalIF":2.4,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143895313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li
{"title":"Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays","authors":"Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li","doi":"10.1016/j.specom.2025.103247","DOIUrl":"10.1016/j.specom.2025.103247","url":null,"abstract":"<div><div>While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at <span><span>https://github.com/Liu-sp/Libri-adhoc-nodes10</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103247"},"PeriodicalIF":2.4,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism","authors":"Xiaoyu Tang , Jiazheng Huang , Yixin Lin , Ting Dang , Jintao Cheng","doi":"10.1016/j.specom.2025.103242","DOIUrl":"10.1016/j.specom.2025.103242","url":null,"abstract":"<div><div>Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. <span><span>https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103242"},"PeriodicalIF":2.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143883032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu
{"title":"Vibravox: A dataset of french speech captured with body-conduction audio sensors","authors":"Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu","doi":"10.1016/j.specom.2025.103238","DOIUrl":"10.1016/j.specom.2025.103238","url":null,"abstract":"<div><div>Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103238"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lexical, syntactic, semantic and acoustic entrainment in Slovak, Spanish, English, and Hungarian: A cross-linguistic comparison","authors":"Jay Kejriwal , Štefan Beňuš","doi":"10.1016/j.specom.2025.103240","DOIUrl":"10.1016/j.specom.2025.103240","url":null,"abstract":"<div><div>Entrainment is the tendency of speakers to reuse each other’s linguistic material, including lexical, syntactic, semantic, or acoustic–prosodic, during a conversation. While entrainment has been studied in English and other Germanic languages, it is less researched in other language groups. In this study, we evaluated lexical, syntactic, semantic, and acoustic–prosodic entrainment in four comparable spoken corpora of four typologically different languages (English, Slovak, Spanish, and Hungarian) using comparable tools and methodologies based on DNN embeddings. Our cross-linguistic comparison revealed that Hungarian speakers are closer to their interlocutors and more consistent with their own linguistic features when compared to English, Slovak, and Spanish speakers. Further, comparison across different linguistic levels within each language revealed that speakers are closest to their partners and most consistent with their own linguistic features at the acoustic level, followed by semantic, lexical, and syntactic levels. Examining the four languages separately, we found that people’s tendency to be close to each other at each turn (proximity) varies at different linguistic levels in different languages. Additionally, we found that entrainment in lexical, syntactic, semantic, and acoustic–prosodic features are positively correlated in all four datasets. Our results are relevant for the predictions of Interactive Alignment theory (Pickering and Garrod, 2004) and may facilitate implementing entrainment functionality in human–machine interactions (HMI).</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103240"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan
{"title":"Expectation of speech style improves audio-visual perception of English vowels","authors":"Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan","doi":"10.1016/j.specom.2025.103243","DOIUrl":"10.1016/j.specom.2025.103243","url":null,"abstract":"<div><div>Speech perception is influenced by both signal-internal properties and signal-independent knowledge, including communicative expectations. This study investigates how these two factors interact, focusing on the role of speech style expectations. Specifically, we examine how prior knowledge about speech style (clear versus plain speech) affects word identification and speech style judgment. Native English perceivers were presented with English words containing tense versus lax vowels in either clear or plain speech, with trial conditions manipulating whether style prompts (presented immediately prior to the target word) were congruent or incongruent with the actual speech style. The stimuli were also presented in three input modalities: auditory (speaker voice), visual (speaker face), and audio-visual. Results show that prior knowledge of speech style improved accuracy in identifying style after the session when style information in the prompt and target word was consistent, particularly in auditory and audio-visual modalities. Additionally, as expected, clear speech enhanced word intelligibility compared to plain speech, with benefits more evident for tense vowels and in auditory and audio-visual contexts. These results demonstrate that congruent style prompts improve style identification accuracy by aligning with high-level expectations, while clear speech enhances word identification accuracy due to signal-internal modifications. Overall, the current findings suggest an interplay of processing sources of information which are both signal-driven and signal-independent, and that high-level signal-complementary information such as speech style is not separate from, but is embodied in, the signal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103243"},"PeriodicalIF":2.4,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143855649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural Chinese silent speech recognition with facial electromyography","authors":"Liang Xie , Yakun Zhang , Hao Yuan , Meishan Zhang , Xingyu Zhang , Changyan Zheng , Ye Yan , Erwei Yin","doi":"10.1016/j.specom.2025.103230","DOIUrl":"10.1016/j.specom.2025.103230","url":null,"abstract":"<div><div>The majority work in speech recognition is based on audible speech and has already achieved great success. However, in several special scenarios, the voice might be unavailable. Recently, Gaddy and Klein (2020) presented an initial study of silent speech analysis, aiming to voice the silent speech from facial electromyography (EMG). In this work, we present the first study of neural silent speech recognition in Chinese, which goes one step further to convert the silent facial EMG signals into text directly. We build a benchmark dataset and then introduce a neural end-to-end model to the task. The model is further optimized with two auxiliary tasks for better feature learning. In addition, we suggest a systematic data augmentation strategy to improve model performance. Experimental results show that our final best model can achieve a character error rate of 38.0% on a sentence-level silent speech recognition task. We also provide in-depth analysis to gain a comprehensive understanding of our task and the various models proposed. Although our model achieves initial results, there is still a gap compared to the ideal level, warranting further attention and research.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103230"},"PeriodicalIF":2.4,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143850540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Disfluency production in children with attention-deficit/hyperactivity disorder during a narrative task","authors":"Annemarie Bijnens , Aurélie Pistono","doi":"10.1016/j.specom.2025.103244","DOIUrl":"10.1016/j.specom.2025.103244","url":null,"abstract":"<div><div>Limited evidence exists on ADHD-related disfluency and lexical diversity behaviour in connected speech, although a significant number of individuals with ADHD experience language difficulties at different linguistic levels. Using a retrospective cross-sectional design with data from the Asymmetries TalkBank database, this study aims to capture differences in disfluency production and lexical diversity between children with ADHD and Typically Developing (TD) children. These measures include the frequencies of different disfluency subtypes and two lexical diversity measures, and are correlated with performance on a working memory task and a response inhibition task. Results indicate that the ADHD group produced a higher mean frequency of each disfluency type, but no differences were found to be significant. Correlation analysis revealed that filled pauses and revisions were negatively correlated with working memory and response inhibition in the ADHD group, whereas they were positively correlated with working memory performance in the TD group. This suggests that the underlying causes of disfluency differ in each group and that further research is required of speech monitoring ability in children with ADHD.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103244"},"PeriodicalIF":2.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding perception and production in loan adaptation: Cases of English loans in Mandarin","authors":"Mingchang Lü","doi":"10.1016/j.specom.2025.103207","DOIUrl":"10.1016/j.specom.2025.103207","url":null,"abstract":"<div><div>This study investigates the formation of English loans in Mandarin from the lens of both perception and production. Excluding loans that involve semantic or lexical adaptation, I explore how the two aspects of perception and production may separately account for various adaptation patterns of segmental change in phonological loans—those whose formation is governed solely by phonological processes. Specifically, perceptual interpretation is composed of auditory (acoustic) correlates. Building upon my previous work, I argue that production involves the adapter's awareness of articulatory economy and attempt to facilitate the interlocutor's perception, in addition to their prosodic knowledge of the native phonology, as addressed at length in my earlier proposals. Conclusions are drawn primarily upon language universals, cross-linguistic trends, and coarticulatory factors. The emergent patterns provide compelling evidence that orthographic influence is only marginal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103207"},"PeriodicalIF":2.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143874568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}