Speech Communication最新文献_第3页

Early identification of bulbar motor dysfunction in ALS: An approach using AFM signal decomposition 肌萎缩性侧索硬化症患者球运动功能障碍的早期识别：一种利用AFM信号分解的方法

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-06 DOI: 10.1016/j.specom.2025.103246

Shaik Mulla Shabber , Mohan Bansal

引用次数: 0

An update rule for multiple source variances estimation using microphone arrays 使用麦克风阵列进行多源方差估计的更新规则

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-30 DOI: 10.1016/j.specom.2025.103245

Fan Zhang , Chao Pan , Jingdong Chen , Jacob Benesty

引用次数: 0

Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays 基于深度学习的舞台二维扬声器定位与大型特设麦克风阵列

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-29 DOI: 10.1016/j.specom.2025.103247

Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li

{"title":"Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays","authors":"Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li","doi":"10.1016/j.specom.2025.103247","DOIUrl":"10.1016/j.specom.2025.103247","url":null,"abstract":"<div><div>While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at <span><span>https://github.com/Liu-sp/Libri-adhoc-nodes10</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103247"},"PeriodicalIF":2.4,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism 基于CNN-Transformer和多维注意机制的语音情感识别

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-23 DOI: 10.1016/j.specom.2025.103242

Xiaoyu Tang , Jiazheng Huang , Yixin Lin , Ting Dang , Jintao Cheng

{"title":"Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism","authors":"Xiaoyu Tang , Jiazheng Huang , Yixin Lin , Ting Dang , Jintao Cheng","doi":"10.1016/j.specom.2025.103242","DOIUrl":"10.1016/j.specom.2025.103242","url":null,"abstract":"<div><div>Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. <span><span>https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103242"},"PeriodicalIF":2.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143883032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vibravox: A dataset of french speech captured with body-conduction audio sensors Vibravox：用身体传导音频传感器捕获的法语语音数据集

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-19 DOI: 10.1016/j.specom.2025.103238

Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu

{"title":"Vibravox: A dataset of french speech captured with body-conduction audio sensors","authors":"Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu","doi":"10.1016/j.specom.2025.103238","DOIUrl":"10.1016/j.specom.2025.103238","url":null,"abstract":"<div><div>Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103238"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lexical, syntactic, semantic and acoustic entrainment in Slovak, Spanish, English, and Hungarian: A cross-linguistic comparison 斯洛伐克语、西班牙语、英语和匈牙利语的词汇、句法、语义和声学夹带：跨语言比较

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-19 DOI: 10.1016/j.specom.2025.103240

Jay Kejriwal , Štefan Beňuš

{"title":"Lexical, syntactic, semantic and acoustic entrainment in Slovak, Spanish, English, and Hungarian: A cross-linguistic comparison","authors":"Jay Kejriwal , Štefan Beňuš","doi":"10.1016/j.specom.2025.103240","DOIUrl":"10.1016/j.specom.2025.103240","url":null,"abstract":"<div><div>Entrainment is the tendency of speakers to reuse each other’s linguistic material, including lexical, syntactic, semantic, or acoustic–prosodic, during a conversation. While entrainment has been studied in English and other Germanic languages, it is less researched in other language groups. In this study, we evaluated lexical, syntactic, semantic, and acoustic–prosodic entrainment in four comparable spoken corpora of four typologically different languages (English, Slovak, Spanish, and Hungarian) using comparable tools and methodologies based on DNN embeddings. Our cross-linguistic comparison revealed that Hungarian speakers are closer to their interlocutors and more consistent with their own linguistic features when compared to English, Slovak, and Spanish speakers. Further, comparison across different linguistic levels within each language revealed that speakers are closest to their partners and most consistent with their own linguistic features at the acoustic level, followed by semantic, lexical, and syntactic levels. Examining the four languages separately, we found that people’s tendency to be close to each other at each turn (proximity) varies at different linguistic levels in different languages. Additionally, we found that entrainment in lexical, syntactic, semantic, and acoustic–prosodic features are positively correlated in all four datasets. Our results are relevant for the predictions of Interactive Alignment theory (Pickering and Garrod, 2004) and may facilitate implementing entrainment functionality in human–machine interactions (HMI).</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103240"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Expectation of speech style improves audio-visual perception of English vowels 语音风格的预期提高了英语元音的视听感知

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-17 DOI: 10.1016/j.specom.2025.103243

Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan

{"title":"Expectation of speech style improves audio-visual perception of English vowels","authors":"Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan","doi":"10.1016/j.specom.2025.103243","DOIUrl":"10.1016/j.specom.2025.103243","url":null,"abstract":"<div><div>Speech perception is influenced by both signal-internal properties and signal-independent knowledge, including communicative expectations. This study investigates how these two factors interact, focusing on the role of speech style expectations. Specifically, we examine how prior knowledge about speech style (clear versus plain speech) affects word identification and speech style judgment. Native English perceivers were presented with English words containing tense versus lax vowels in either clear or plain speech, with trial conditions manipulating whether style prompts (presented immediately prior to the target word) were congruent or incongruent with the actual speech style. The stimuli were also presented in three input modalities: auditory (speaker voice), visual (speaker face), and audio-visual. Results show that prior knowledge of speech style improved accuracy in identifying style after the session when style information in the prompt and target word was consistent, particularly in auditory and audio-visual modalities. Additionally, as expected, clear speech enhanced word intelligibility compared to plain speech, with benefits more evident for tense vowels and in auditory and audio-visual contexts. These results demonstrate that congruent style prompts improve style identification accuracy by aligning with high-level expectations, while clear speech enhances word identification accuracy due to signal-internal modifications. Overall, the current findings suggest an interplay of processing sources of information which are both signal-driven and signal-independent, and that high-level signal-complementary information such as speech style is not separate from, but is embodied in, the signal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103243"},"PeriodicalIF":2.4,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143855649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural Chinese silent speech recognition with facial electromyography 基于面部肌电图的汉语无声语音神经识别

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-15 DOI: 10.1016/j.specom.2025.103230

Liang Xie , Yakun Zhang , Hao Yuan , Meishan Zhang , Xingyu Zhang , Changyan Zheng , Ye Yan , Erwei Yin

{"title":"Neural Chinese silent speech recognition with facial electromyography","authors":"Liang Xie , Yakun Zhang , Hao Yuan , Meishan Zhang , Xingyu Zhang , Changyan Zheng , Ye Yan , Erwei Yin","doi":"10.1016/j.specom.2025.103230","DOIUrl":"10.1016/j.specom.2025.103230","url":null,"abstract":"<div><div>The majority work in speech recognition is based on audible speech and has already achieved great success. However, in several special scenarios, the voice might be unavailable. Recently, Gaddy and Klein (2020) presented an initial study of silent speech analysis, aiming to voice the silent speech from facial electromyography (EMG). In this work, we present the first study of neural silent speech recognition in Chinese, which goes one step further to convert the silent facial EMG signals into text directly. We build a benchmark dataset and then introduce a neural end-to-end model to the task. The model is further optimized with two auxiliary tasks for better feature learning. In addition, we suggest a systematic data augmentation strategy to improve model performance. Experimental results show that our final best model can achieve a character error rate of 38.0% on a sentence-level silent speech recognition task. We also provide in-depth analysis to gain a comprehensive understanding of our task and the various models proposed. Although our model achieves initial results, there is still a gap compared to the ideal level, warranting further attention and research.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103230"},"PeriodicalIF":2.4,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143850540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Disfluency production in children with attention-deficit/hyperactivity disorder during a narrative task 注意缺陷/多动障碍儿童在叙述任务中的不流畅产生

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-11 DOI: 10.1016/j.specom.2025.103244

Annemarie Bijnens , Aurélie Pistono

{"title":"Disfluency production in children with attention-deficit/hyperactivity disorder during a narrative task","authors":"Annemarie Bijnens , Aurélie Pistono","doi":"10.1016/j.specom.2025.103244","DOIUrl":"10.1016/j.specom.2025.103244","url":null,"abstract":"<div><div>Limited evidence exists on ADHD-related disfluency and lexical diversity behaviour in connected speech, although a significant number of individuals with ADHD experience language difficulties at different linguistic levels. Using a retrospective cross-sectional design with data from the Asymmetries TalkBank database, this study aims to capture differences in disfluency production and lexical diversity between children with ADHD and Typically Developing (TD) children. These measures include the frequencies of different disfluency subtypes and two lexical diversity measures, and are correlated with performance on a working memory task and a response inhibition task. Results indicate that the ADHD group produced a higher mean frequency of each disfluency type, but no differences were found to be significant. Correlation analysis revealed that filled pauses and revisions were negatively correlated with working memory and response inhibition in the ADHD group, whereas they were positively correlated with working memory performance in the TD group. This suggests that the underlying causes of disfluency differ in each group and that further research is required of speech monitoring ability in children with ADHD.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103244"},"PeriodicalIF":2.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding perception and production in loan adaptation: Cases of English loans in Mandarin 了解贷款适应中的感知和生产：普通话中的英语借词案例

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-04-08 DOI: 10.1016/j.specom.2025.103207

Mingchang Lü

引用次数: 0