Yizhou Wang , Rikke Bundgaard-Nielsen , Brett Baker , Olga Maxwell
{"title":"Gradient or categorical? Towards a phonological typology of illusory vowels in Mandarin","authors":"Yizhou Wang , Rikke Bundgaard-Nielsen , Brett Baker , Olga Maxwell","doi":"10.1016/j.specom.2025.103252","DOIUrl":"10.1016/j.specom.2025.103252","url":null,"abstract":"<div><div>This paper argues that illusory vowel perception, i.e., the perception of non-existent vowels between two consonants by nonnative listeners, is gradient rather than categorical in Mandarin Chinese, and that the strength of illusion is predictable from the mismatches between the nonnative speech input and the listeners’ native phonological grammar. We examined five phonological scenarios where illusory vowels with different qualities can be perceived, and different illusion levels can be predicted by factors including syllable phonotactic constraints, vowel minimality, and the place of articulation consistency between the illusory vowel and its preceding consonant. The predictions were examined in an AXB discrimination task (Experiment 1) and an identification task (Experiment 2), which confirmed the predictions overall, while some paradigmatic differences were also observed. By comparing the current results and previous reports, we argue that a gradient rather than categorical account of illusory vowel is more suitable for explaining and predicting nonnative cluster perception. Specifically, the place of articulation feature of the preceding consonant is important for predicting contextual illusory vowels, which reflects nonnative listeners’ interpretation of perceived gestural score across multiple segments, supporting a direct realist view of speech perception.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103252"},"PeriodicalIF":2.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"\"I said simPle, not symBol!\"Is clear speech tailored to the listener's feedback","authors":"Maëva Garnier, Marion Dohen","doi":"10.1016/j.specom.2025.103251","DOIUrl":"10.1016/j.specom.2025.103251","url":null,"abstract":"<div><div>This study investigates variation in the production of French stop consonants in two situations of speech clarity enhancement – when addressing an interlocutor experiencing listening difficulties in a disrupted communication environment (clear speech), and when correcting specific listener misunderstandings (corrected speech). Of interest is whether speech modifications are similar in both situations, or if adjustments during correction specifically address listeners' errors.</div><div>Twelve native French speakers interacted with the experimenter in a gaming task, first in conversational speech ('Conv') under normal conditions, then in clear speech prompted by apparent listening difficulties from the interlocutor ('Clear'). In the disrupted situation, some words were misunderstood by the listener (errors in either voicing or articulation place of stop consonants), resulting in additional corrections by the speaker ('Clear+Corr').</div><div>Significant changes in the timing and spectral cues of stop consonants (closure duration, Voice Onset Time, burst spectrum) were observed in both clear and corrected speech, improving distinctions between voiced and voiceless stops and articulation places. Additionally, clear speech prompted by listening difficulties showed global modifications (overall increased intensity, longer syllable duration, hyper-articulated vowels). Conversely, corrected speech focused solely on segmental modifications, with burst spectrum variations significantly influenced by listener feedback, emphasizing the distinction between the speaker's intended segment and the misunderstood one.</div><div>The results suggest that both situations of speech clarity enhancement involve different strategies, with speech correction relying on real-time perception of the listener's feedback to specifically address perceptual errors.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103251"},"PeriodicalIF":2.4,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144069930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giorgio Piazza , Marina Kalashnikova , Laura Fernández-Merino , Clara D. Martin
{"title":"Speakers’ communicative intentions lead to acoustic adjustments in native and non-native directed speech","authors":"Giorgio Piazza , Marina Kalashnikova , Laura Fernández-Merino , Clara D. Martin","doi":"10.1016/j.specom.2025.103250","DOIUrl":"10.1016/j.specom.2025.103250","url":null,"abstract":"<div><div>Speakers adapt acoustic features to factors such as listeners’ linguistic profiles. For instance, addressing a non-native listener elicits Non-Native Directed Speech (NNDS). However, whether these speech adaptations vary depending on the speakers’ didactic goals, in interaction with the listeners' profiles (i.e., native vs. non-native), remains unknown.</div><div>We recorded native Spanish speakers naming novel objects to aid their listeners’ performance in comprehension, pronunciation, and writing tasks. Each speaker interacted with a native (Native Directed Speech, NDS) and a non-native (NNDS) Spanish listener. We extracted measures of vowel hyperarticulation, duration, intensity, speech rate, and F0 to assess listener- and task-specific speech adjustments.</div><div>Our results showed that speakers hyperarticulated vowels to a greater extent in the writing condition compared to the comprehension condition, and during NNDS compared to NDS. Listener profile and task also impacted speakers’ F0 height, intensity, and vowel duration production. Therefore, speakers adjust acoustic features in their speech to achieve their didactic goals and accommodate their listener's profile. Also, speakers’ overall greater adaptation in NNDS than in NDS suggests that NNDS serves a didactic purpose.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103250"},"PeriodicalIF":2.4,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144069931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Early identification of bulbar motor dysfunction in ALS: An approach using AFM signal decomposition","authors":"Shaik Mulla Shabber , Mohan Bansal","doi":"10.1016/j.specom.2025.103246","DOIUrl":"10.1016/j.specom.2025.103246","url":null,"abstract":"<div><div>Amyotrophic lateral sclerosis (ALS) is an aggressive neurodegenerative disorder that impacts the nerve cells in the brain and spinal cord that control muscle movements. Early ALS symptoms include speech and swallowing difficulties, and sadly, the disease is incurable and fatal in some instances. This study aims to construct a predictive model for identifying speech dysarthria and bulbar motor dysfunction in ALS patients, using speech signals as a non-invasive biomarker. Utilizing an amplitude and frequency modulated (AFM) signal decomposition model, the study identifies distinctive characteristics crucial for monitoring and diagnosing ALS. The study focuses on classifying ALS patients and healthy controls (HC) through a machine-learning approach, employing the TORGO database for analysis. Recognizing speech signals as potential biomarkers for ALS detection, the study aims to achieve early identification without invasive measures. An ensemble learning classifier attains a remarkable 97% accuracy in distinguishing between ALS and HC based on features extracted using the AFM signal model.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103246"},"PeriodicalIF":2.4,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Zhang , Chao Pan , Jingdong Chen , Jacob Benesty
{"title":"An update rule for multiple source variances estimation using microphone arrays","authors":"Fan Zhang , Chao Pan , Jingdong Chen , Jacob Benesty","doi":"10.1016/j.specom.2025.103245","DOIUrl":"10.1016/j.specom.2025.103245","url":null,"abstract":"<div><div>This paper addresses the problem of time-varying variance estimation in scenarios with multiple speech sources and background noise using a microphone array, which is an important issue in speech enhancement. Under the optimal principle of maximum likelihood (ML), the variance estimation under the general cases occurs no explicit formula, and all the variances require to be updated iteratively. Inspired by the fixed-point iteration (FPI) method, we derive an update rule for variance estimation by introducing a dummy term and exploiting the ML condition. Insights into the update rule is investigated and the relationship with the variance estimates under least-squares (LS) principle is presented. Finally, by simulations, we show that the resulting variance update rule is very efficient and effective, which requires only a few iterations to converge, and the estimation error is very close to the Cramér–Rao Bound (CRB).</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103245"},"PeriodicalIF":2.4,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143895313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li
{"title":"Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays","authors":"Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li","doi":"10.1016/j.specom.2025.103247","DOIUrl":"10.1016/j.specom.2025.103247","url":null,"abstract":"<div><div>While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at <span><span>https://github.com/Liu-sp/Libri-adhoc-nodes10</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103247"},"PeriodicalIF":2.4,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism","authors":"Xiaoyu Tang , Jiazheng Huang , Yixin Lin , Ting Dang , Jintao Cheng","doi":"10.1016/j.specom.2025.103242","DOIUrl":"10.1016/j.specom.2025.103242","url":null,"abstract":"<div><div>Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. <span><span>https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103242"},"PeriodicalIF":2.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143883032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu
{"title":"Vibravox: A dataset of french speech captured with body-conduction audio sensors","authors":"Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu","doi":"10.1016/j.specom.2025.103238","DOIUrl":"10.1016/j.specom.2025.103238","url":null,"abstract":"<div><div>Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103238"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lexical, syntactic, semantic and acoustic entrainment in Slovak, Spanish, English, and Hungarian: A cross-linguistic comparison","authors":"Jay Kejriwal , Štefan Beňuš","doi":"10.1016/j.specom.2025.103240","DOIUrl":"10.1016/j.specom.2025.103240","url":null,"abstract":"<div><div>Entrainment is the tendency of speakers to reuse each other’s linguistic material, including lexical, syntactic, semantic, or acoustic–prosodic, during a conversation. While entrainment has been studied in English and other Germanic languages, it is less researched in other language groups. In this study, we evaluated lexical, syntactic, semantic, and acoustic–prosodic entrainment in four comparable spoken corpora of four typologically different languages (English, Slovak, Spanish, and Hungarian) using comparable tools and methodologies based on DNN embeddings. Our cross-linguistic comparison revealed that Hungarian speakers are closer to their interlocutors and more consistent with their own linguistic features when compared to English, Slovak, and Spanish speakers. Further, comparison across different linguistic levels within each language revealed that speakers are closest to their partners and most consistent with their own linguistic features at the acoustic level, followed by semantic, lexical, and syntactic levels. Examining the four languages separately, we found that people’s tendency to be close to each other at each turn (proximity) varies at different linguistic levels in different languages. Additionally, we found that entrainment in lexical, syntactic, semantic, and acoustic–prosodic features are positively correlated in all four datasets. Our results are relevant for the predictions of Interactive Alignment theory (Pickering and Garrod, 2004) and may facilitate implementing entrainment functionality in human–machine interactions (HMI).</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103240"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143876675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan
{"title":"Expectation of speech style improves audio-visual perception of English vowels","authors":"Joan A. Sereno , Allard Jongman , Yue Wang , Paul Tupper , Dawn M. Behne , Jetic Gu , Haoyao Ruan","doi":"10.1016/j.specom.2025.103243","DOIUrl":"10.1016/j.specom.2025.103243","url":null,"abstract":"<div><div>Speech perception is influenced by both signal-internal properties and signal-independent knowledge, including communicative expectations. This study investigates how these two factors interact, focusing on the role of speech style expectations. Specifically, we examine how prior knowledge about speech style (clear versus plain speech) affects word identification and speech style judgment. Native English perceivers were presented with English words containing tense versus lax vowels in either clear or plain speech, with trial conditions manipulating whether style prompts (presented immediately prior to the target word) were congruent or incongruent with the actual speech style. The stimuli were also presented in three input modalities: auditory (speaker voice), visual (speaker face), and audio-visual. Results show that prior knowledge of speech style improved accuracy in identifying style after the session when style information in the prompt and target word was consistent, particularly in auditory and audio-visual modalities. Additionally, as expected, clear speech enhanced word intelligibility compared to plain speech, with benefits more evident for tense vowels and in auditory and audio-visual contexts. These results demonstrate that congruent style prompts improve style identification accuracy by aligning with high-level expectations, while clear speech enhances word identification accuracy due to signal-internal modifications. Overall, the current findings suggest an interplay of processing sources of information which are both signal-driven and signal-independent, and that high-level signal-complementary information such as speech style is not separate from, but is embodied in, the signal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103243"},"PeriodicalIF":2.4,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143855649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}