Laura Wagner , Sharifa Alghowinhem , Abeer Alwan , Kristina Bowdrie , Cynthia Breazeal , Cynthia G. Clopper , Eric Fosler-Lussier , Izabela A. Jamsek , Devan Lander , Rajiv Ramnath , Jory Ross
{"title":"The Ohio Child Speech Corpus","authors":"Laura Wagner , Sharifa Alghowinhem , Abeer Alwan , Kristina Bowdrie , Cynthia Breazeal , Cynthia G. Clopper , Eric Fosler-Lussier , Izabela A. Jamsek , Devan Lander , Rajiv Ramnath , Jory Ross","doi":"10.1016/j.specom.2025.103206","DOIUrl":"10.1016/j.specom.2025.103206","url":null,"abstract":"<div><div>This paper reports on the creation and composition of a new corpus of children's speech, the Ohio Child Speech Corpus, which is publicly available on the Talkbank-CHILDES website. The audio corpus contains speech samples from 303 children ranging in age from 4 – 9 years old, all of whom participated in a seven-task elicitation protocol conducted in a science museum lab. In addition, an interactive social robot controlled by the researchers joined the sessions for approximately 60% of the children, and the corpus itself was collected in the peri‑pandemic period. Two analyses are reported that highlighted these last two features. One set of analyses found that the children spoke significantly more in the presence of the robot relative to its absence, but no effects of speech complexity (as measured by MLU) were found for the robot's presence. Another set of analyses compared children tested immediately post-pandemic to children tested a year later on two school-readiness tasks, an Alphabet task and a Reading Passages task. This analysis showed no negative impact on these tasks for our highly-educated sample of children just coming off of the pandemic relative to those tested later. These analyses demonstrate just two possible types of questions that this corpus could be used to investigate.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103206"},"PeriodicalIF":2.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Swaileh A. Alzaidi , Saaed A. Saaed , Mohammad A.S. Bani Younes
{"title":"Phonetic realizations of focus in declarative intonation in Iraqi Arabic","authors":"Muhammad Swaileh A. Alzaidi , Saaed A. Saaed , Mohammad A.S. Bani Younes","doi":"10.1016/j.specom.2025.103203","DOIUrl":"10.1016/j.specom.2025.103203","url":null,"abstract":"<div><div>This study investigates how both information and contrastive focus in the Iraq Arabic dialect are prosodically realized, contributing to the ongoing debate on focus-marking across languages and within Arabic dialects. Using a question-answer paradigm, we elicited information focus, contrastive focus and neutral-focus in sentence-initial, sentence-penultimate and sentence-final positions. Systematic analyses were done through investigating the continuous <em>f</em><sub>0</sub> trajectories and specific acoustic measurements including maximum <em>f</em><sub>0</sub>, mean <em>f</em><sub>0</sub>, minimum <em>f</em><sub>0</sub>, excursion size, intensity and duration. The results reveal that prosodic patterns are significantly influenced by the type of focus (information focus vs. contrastive focus) and its position within the sentence. Both information and contrastive focus lead to distinct prosodic patterns compared to neutral focus, with specific features being more sensitive to focus type depending on their sentential position. In particular, contrastive focus tends to have a stronger intensity than information focus, especially in sentence-final positions. Additionally, the presence of focus, especially when sentence-initial, significantly reduces the pitch (mean <em>f</em><sub>0</sub> and minimum <em>f</em><sub>0</sub>) and intensity of post-focus words, with contrastive focus having a more pronounced effect on lowering the minimum <em>f</em><sub>0</sub> of subsequent words compared to information focus. The findings further indicate that sentence-penultimate focus generally reduces <em>f</em><sub>0</sub> and duration of pre-focus words more comprehensively, while contrastive focus exerts a stronger influence on <em>f</em><sub>0</sub> reduction in sentence-final positions. These results (a) underscore the nuanced role of focus in shaping the prosodic structure of sentences, (b) demonstrate that PFC occurs in Iraqi Arabic, making it similar to Egyptian, Emirati, Hijazi, Jizani, Lebanese and Najdi Arabic but different from Makkan Arabic. These results have implications for prosodic typology.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103203"},"PeriodicalIF":2.4,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143520922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jana Roßbach , Nils L. Westhausen , Hendrik Kayser , Bernd T. Meyer
{"title":"Non-intrusive binaural speech recognition prediction for hearing aid processing","authors":"Jana Roßbach , Nils L. Westhausen , Hendrik Kayser , Bernd T. Meyer","doi":"10.1016/j.specom.2025.103202","DOIUrl":"10.1016/j.specom.2025.103202","url":null,"abstract":"<div><div>Hearing aids (HAs) often feature different signal processing algorithms to optimize speech recognition (SR) in a given acoustic environment. In this paper, we explore if models that predict SR performance of hearing-impaired (HI), aided users are applicable to automatically select the best algorithm. To this end, SR experiments are conducted with 19 HI subjects who are aided with an open-source HA. Listeners’ SR is measured in virtual, complex acoustic scenes with two distinct noise conditions using the different speech enhancement strategies implemented in this HA. For model-based selection, we apply a PHOneme-based Binaural Intelligibility model (PHOBI) based on our previous work and extended with a component for simulating hearing loss. The non-intrusive model utilizes a deep neural network to predict phone probabilities; the deterioration of these phone representations in the presence of noise or generally signal degradation is quantified and used as model output. PHOBI model is trained with 960 h of English speech signals, a broad range of noise signals and room impulse responses. The performance of model-based algorithm selection is measured with two metrics: (i) Its ability to rank the HA algorithms in the order of subjective SR results and (ii) the SR difference between the measured best algorithm and the model-based selection (<span><math><mi>Δ</mi></math></span>SR). Results are compared to selections obtained with one non-intrusive and two intrusive models. PHOBI outperforms the non-intrusive and one of the intrusive models in both noise conditions, achieving significantly higher correlations (<span><math><mrow><mi>r</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>63</mn></mrow></math></span> and 0.80). <span><math><mi>Δ</mi></math></span>SR scores are significantly lower (better) compared to the non-intrusive baseline (3.5% and 4.6% against 8.6% and 9.8%, respectively). The results in terms of <span><math><mi>Δ</mi></math></span>SR between PHOBI and the intrusive models are statistically not different, although PHOBI operates on the observed signal alone and does not require a clean reference signal.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103202"},"PeriodicalIF":2.4,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143471565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nasal coarticulation in Lombard speech","authors":"Justin J.H. Lo","doi":"10.1016/j.specom.2025.103205","DOIUrl":"10.1016/j.specom.2025.103205","url":null,"abstract":"<div><div>Speaking in noisy environments entails a multitude of adaptations to speech production. Such modifications are expected to reduce gestural overlap between neighbouring sounds in order to enhance their distinctiveness, yet evidence for reduced coarticulation has been ambiguous. Nasal coarticulation in particular presents an unusual case, as it has been suggested to increase instead in certain clear speech conditions. The current study presents an experiment aimed at investigating how use of nasal coarticulation varies in quiet and noisy speech conditions. Speakers of Southern British English were recorded using a nasometer in an interactive reading task and produced monosyllabic target words with vowels bound by combinations of stop and nasal consonants. Use of nasal coarticulation was compared by means of a normalised measure that takes into account the speaker- and vowel-specific range of nasalisation available in each condition. In two noisy conditions where the interlocutor was either visible or not, vowel nasality in coarticulatory contexts was found to decrease in a way that closely tracked the compressed range between oral and nasal baselines. Speakers thus maintained their use of nasal coarticulation in Lombard speech, especially in the anticipatory direction. These findings suggest that the spreading of the velum lowering gesture from nasal consonants to neighbouring vowels is not targeted for adaptation in Lombard speech. They further reaffirm that enhancing acoustic distinctiveness and maintaining coarticulation are joint, compatible goals in the production of hyperarticulated speech.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103205"},"PeriodicalIF":2.4,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingyang Wang , Min Xu , Jing Shao , Jiaqiang Zhu , Yike Yang , Nan Yan , Lan Wang , Yongjie Zhou
{"title":"Vocal emotion perception in Mandarin-speaking older adults with hearing loss","authors":"Yingyang Wang , Min Xu , Jing Shao , Jiaqiang Zhu , Yike Yang , Nan Yan , Lan Wang , Yongjie Zhou","doi":"10.1016/j.specom.2025.103204","DOIUrl":"10.1016/j.specom.2025.103204","url":null,"abstract":"<div><div>Challenges in the ability of older adults to comprehend vocal emotion have been documented. However, limited research has investigated the combined effects of aging and age-related hearing loss. The present study aimed to bridge this research gap by comparing the performance of three participant groups (younger adults with normal hearing, older adults with hearing loss, and older adults without hearing loss) in identification of “happy” and “sad” emotions via prosodic and semantic channels. We conducted regression models to investigate the relationship between age, hearing threshold, cognitive abilities and overall emotion perception performance. Results of emotion identification accuracy demonstrated that older adults with hearing loss performed worse than other two normal hearing groups in both channels. Additionally, only older adults with hearing loss showed lower accuracy in the emotional prosody than semantics, indicating only this group is influenced by channel. As for response time, both older listener groups responded more slowly than younger listeners in both channels. They also exhibited faster responses to “happy” compared to “sad”, supporting the positivity effect on emotion perception in older participants. Moreover, the regression models indicated that age, hearing threshold and working memory (measured by Digit Span test) were predictive of participants' overall identification accuracy, and selective attention (measured by Stroop test) was predictive of participants’ overall reaction time. These findings suggest that the degraded emotion perception among older adults is attributed to complex underlying mechanisms, which can be reasonably explained by not only aging but also the decline in hearing sensitivity and cognitive functions.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103204"},"PeriodicalIF":2.4,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143479323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"APIN: Amplitude- and phase-aware interaction network for speech emotion recognition","authors":"Lili Guo , Jie Li , Shifei Ding , Jianwu Dang","doi":"10.1016/j.specom.2025.103201","DOIUrl":"10.1016/j.specom.2025.103201","url":null,"abstract":"<div><div>Speech emotion recognition (SER) occupies a critical position in human-computer interaction and has garnered significant attention from many researchers. A common approach in SER is using deep networks to process acoustic features. Complete acoustic features are made up of amplitude and phase information. However, the majority of existing methods concentrate on amplitude information and a few studies have initially considered phase information, discarding phase information will result in the loss of some emotional information. To fully utilize the complementarity of amplitude and phase information, this paper proposes the amplitude- and phase-aware interaction network (APIN) for SER. The proposed APIN comprises two main modules, i.e., amplitude-phase (A-P) interaction with transformer and gated fusion. Especially, the A-P interaction module enables amplitude and phase to guide and complement each other when learning emotional representations. Subsequently, the adaptive gated module is introduced to further fuse amplitude representation and phase representation. Finally, experiments were conducted on two benchmark datasets including the EmoDB and IEMOCAP. Extensive experiments demonstrate that the proposed APIN outperforms traditional methods that rely solely on amplitude information or use both amplitude and phase information as well as several state-of-the-art approaches.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103201"},"PeriodicalIF":2.4,"publicationDate":"2025-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143378874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comprehensive study on supervised single-channel noisy speech separation with multi-task learning","authors":"Shaoxiang Dang , Tetsuya Matsumoto , Yoshinori Takeuchi , Hiroaki Kudo","doi":"10.1016/j.specom.2024.103162","DOIUrl":"10.1016/j.specom.2024.103162","url":null,"abstract":"<div><div>This research presents a comprehensive investigation and comparison of noisy speech separation methods using multi-task learning. First, we categorize all methods into two pipelines: enhancement priority pipeline (EPP) and separation priority pipeline (SPP), based on whether prioritizing enhancement or separation. Next, we classify each pipeline into shared encoder–decoder scheme (SEDS) and independent encoder–decoder scheme (IEDS), depending on whether the two modules share the same encoder and decoder. Additionally, we introduce two types of intermediate structures between the two modules. One structure uses time–frequency (T–F) representations, while the other uses T–F masks. This article elaborates on the strengths and weaknesses of each approach, particularly in mitigating over-suppression and improving computational efficiency. Our experiments show substantial improvements in SPP with IEDS across multiple metrics on the LibriXmix dataset. In addition, by replacing the synthesis-based trick in the enhancement module, the model achieves superior generalization on the LibriCSS dataset.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103162"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An overview of high-resource automatic speech recognition methods and their empirical evaluation in low-resource environments","authors":"Kavan Fatehi , Mercedes Torres Torres , Ayse Kucukyilmaz","doi":"10.1016/j.specom.2024.103151","DOIUrl":"10.1016/j.specom.2024.103151","url":null,"abstract":"<div><div>Deep learning methods for Automatic Speech Recognition (ASR) often rely on large-scale training datasets, which are typically unavailable in low-resource environments (LREs). This lack of sufficient and representative training data poses a significant challenge for applying ASR systems in specific domains categorized as LREs. In this paper, we provide a comprehensive overview and empirical analysis of state-of-the-art deep learning techniques for ASR, which are primarily designed for high-resource environments (HREs). Our aim is to explore their potential effectiveness in LRE settings. We focus on identifying key factors that influence the adaptation of HRE models to LRE tasks. To this end, we survey advanced deep learning models and conduct a comparative evaluation of their performance in LRE contexts. Additionally, we propose that pre-training ASR models on HRE datasets, followed by domain-specific fine-tuning on LRE data, can significantly enhance performance in data-scarce settings. Using LibriSpeech and WSJ as our HRE datasets, we evaluate these models on two LRE datasets: UASpeech for dysarthria speech and iCUBE, our novel human–robot interaction dataset. Our systematic experiments, involving varying dataset sizes for pre-training, demonstrate the efficacy of combining pre-training and fine-tuning strategies to improve recognition accuracy in LREs.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103151"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A model of early word acquisition based on realistic-scale audiovisual naming events","authors":"Khazar Khorrami, Okko Räsänen","doi":"10.1016/j.specom.2024.103169","DOIUrl":"10.1016/j.specom.2024.103169","url":null,"abstract":"<div><div>Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103169"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nan Li , Meng Ge , Longbiao Wang , Yang-Hao Zhou , Jianwu Dang
{"title":"HC-APNet: Harmonic Compensation Auditory Perception Network for low-complexity speech enhancement","authors":"Nan Li , Meng Ge , Longbiao Wang , Yang-Hao Zhou , Jianwu Dang","doi":"10.1016/j.specom.2024.103161","DOIUrl":"10.1016/j.specom.2024.103161","url":null,"abstract":"<div><div>Speech enhancement is critical for improving speech quality and intelligibility in a variety of noisy environments. While neural network-based methods have shown promising results in speech enhancement, they often suffer from performance degradation in scenarios with limited computational resources. This paper presents HC-APNet (Harmonic Compensation Auditory Perception Network), a novel lightweight approach tailored to exploit the perceptual capabilities of the human auditory system for efficient and effective speech enhancement, with a focus on harmonic compensation. Inspired by human auditory reception mechanisms, we first segment audio into subbands using an auditory filterbank for speech enhancement. The use of subbands helps to reduce the number of parameters and the computational load, while the use of an auditory filterbank effectively preserves high-quality speech enhancement. In addition, inspired by the perception of human auditory context, we have developed an auditory perception network to capture gain information for different subbands. Furthermore, considering that subband processing only applies gain to the spectral envelope, which may introduce harmonic distortion, we design a learnable multi-subband comb-filter inspired by human pitch frequency perception to mitigate harmonic distortion. Finally, our proposed HC-APNet model achieves competitive performance on the speech quality evaluation metric with significantly less computational and parameter resources compared to existing methods on the VCTK + DEMAND and DNS Challenge datasets.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103161"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}