Speech Communication最新文献_第2页

A feature engineering approach for literary and colloquial Tamil speech classification using 1D-CNN 使用1D-CNN进行文学和口语化泰米尔语分类的特征工程方法

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-29 DOI: 10.1016/j.specom.2025.103254

M. Nanmalar , S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan

{"title":"A feature engineering approach for literary and colloquial Tamil speech classification using 1D-CNN","authors":"M. Nanmalar , S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan","doi":"10.1016/j.specom.2025.103254","DOIUrl":"10.1016/j.specom.2025.103254","url":null,"abstract":"<div><div>In ideal human computer interaction (HCI), the colloquial form of a language would be preferred by most users, since it is the form used in their day-to-day conversations. However, there is also an undeniable necessity to preserve the formal literary form. By embracing the new and preserving the old, both service to the common man (practicality) and service to the language itself (conservation) can be rendered. Hence, it is ideal for computers to have the ability to accept, process, and converse in both forms of the language, as required. To address this, it is first necessary to identify the form of the input speech, which in the current work is between literary and colloquial Tamil speech. Such a front-end system must consist of a simple, effective, and lightweight classifier that is trained on a few effective features that are capable of capturing the underlying patterns of the speech signal. To accomplish this, a one-dimensional convolutional neural network (1D-CNN) that learns the envelope of features across time, is proposed. The network is trained on a select number of handcrafted features initially, and then on Mel frequency cepstral coefficients (MFCC) for comparison. The handcrafted features were selected to address various aspects of speech such as the spectral and temporal characteristics, prosody, and voice quality. The features are initially analyzed by considering ten parallel utterances and observing the trend of each feature with respect to time. The proposed 1D-CNN, trained using the handcrafted features, offers an F1 score of 0.9803, while that trained on the MFCC offers an F1 score of 0.9895. In light of this, feature ablation and feature combination are explored. When the best ranked handcrafted features, from the feature ablation study, are combined with the MFCC, they offer the best results with an F1 score of 0.9946.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103254"},"PeriodicalIF":2.4,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144490027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Phonological level wav2vec2-based Mispronunciation Detection and Diagnosis method 基于语音层面wav2vec2的语音错误检测与诊断方法

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-23 DOI: 10.1016/j.specom.2025.103249

Mostafa Shahin, Julien Epps, Beena Ahmed

{"title":"Phonological level wav2vec2-based Mispronunciation Detection and Diagnosis method","authors":"Mostafa Shahin, Julien Epps, Beena Ahmed","doi":"10.1016/j.specom.2025.103249","DOIUrl":"10.1016/j.specom.2025.103249","url":null,"abstract":"<div><div>The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. Due to the unpredictable nature of pronunciation errors made by non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches can provide only limited diagnostic information about the error made. To address this, in this paper, we propose a low-level MDD approach based on the detection of phonological features. Phonological features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback for the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive phonological features using a single model. The pre-trained wav2vec2 model was employed as a core model for the phonological feature detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed phonological level MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all phonological features compared to the phoneme-level equivalent.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103249"},"PeriodicalIF":2.4,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144365231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quantifying division of labour: Effects of clause type on intonational meaning 量化分工：子句类型对语调意义的影响

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-18 DOI: 10.1016/j.specom.2025.103265

Johannes M. Heim

{"title":"Quantifying division of labour: Effects of clause type on intonational meaning","authors":"Johannes M. Heim","doi":"10.1016/j.specom.2025.103265","DOIUrl":"10.1016/j.specom.2025.103265","url":null,"abstract":"<div><div>This paper reports quantifiable evidence for a clean division of labour between syntax and prosody in deriving the meaning of rising intonation. This evidence stems from two perception studies that asked participants to rate the speaker attitudes and their response expectation expressed by rising declaratives and interrogatives. Rises were manipulated by changing pitch excursion and duration which are known to affect their interpretation. One part in each study addressed the relation between the contour shape and the perception of speaker confidence or certainty, another part addressed the relation between the contour shape and the perception of response expectation. The two studies differed in whether the rise was paired with a declarative or an interrogative clause. Across clause-types, higher excursion led to lower ratings of speaker confidence/certainty and higher ratings for response expectation. For declaratives only, large duration differences also affected ratings of speaker confidence. While the patterns emerging about prosodic form-function mapping were similar across clause types, the effect sizes differed notably. This suggests that pitch excursion, and possibly duration, have clause-type-independent effects, which are moderated by default expectations of contour and clause-type combinations. Such an interpretation supports previous compositional accounts of intonational meaning that ascribe independent functions to clause type and intonation, each contributing to their conversational effects.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103265"},"PeriodicalIF":2.4,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144185699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Blood pressure monitoring from naturally recorded speech sounds: advancements and future prospects 从自然录制的语音中监测血压：进步和未来前景

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-13 DOI: 10.1016/j.specom.2025.103255

Fikret Arı , Haydar Ankışhan , Blaise B. Frederick , Lia M. Hocke , Sinem B. Erdoğan

{"title":"Blood pressure monitoring from naturally recorded speech sounds: advancements and future prospects","authors":"Fikret Arı , Haydar Ankışhan , Blaise B. Frederick , Lia M. Hocke , Sinem B. Erdoğan","doi":"10.1016/j.specom.2025.103255","DOIUrl":"10.1016/j.specom.2025.103255","url":null,"abstract":"<div><div>The development of an accurate, cuffless system for continuous monitoring of blood pressure is essential to reduce the number of deaths due to hypertension. In this study, we present a groundbreaking artificial intelligence-based system developed for accurate blood pressure prediction from spoken sentences in natural everyday situations, using only a smartphone without additional measurements. Our method uses hyperparameter-tuned machine learning (ML) techniques, including Synthetic Minority Over-sampling Technique (SMOTE), to classify blood pressure as normal or high. By automatically detecting vowels in recorded speech sentences, we extract a statistical features vector with demographic information (1 × 59-D). Experimental results highlight impressive classification accuracies, reaching 98.45% for systolic BP and 99.61% for diastolic BP with the Adaptive synthetic sampling approach for imbalanced learning (ADASYN). These findings underscore the meaningful physiological information embedded in human speech and demonstrate the potential of our hyperparameter-tuned ML methods in revolutionizing health monitoring practices, particularly in the domain of telehealth, internet of things devices and remote monitoring.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103255"},"PeriodicalIF":2.4,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Human and automatic voice comparison with regionally variable speech samples 人类和自动语音比较区域可变语音样本

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-12 DOI: 10.1016/j.specom.2025.103253

Vincent Hughes , Carmen Llamas , Thomas Kettig

{"title":"Human and automatic voice comparison with regionally variable speech samples","authors":"Vincent Hughes , Carmen Llamas , Thomas Kettig","doi":"10.1016/j.specom.2025.103253","DOIUrl":"10.1016/j.specom.2025.103253","url":null,"abstract":"<div><div>In this paper, we compare and combine human and automatic voice comparison results based on short, regionally variable speech samples. Likelihood ratio-like scores were extracted for 120 pairs of same- (45) and different-speaker (75) samples from a total of 896 British English listeners. The samples contained the voices of speakers from Newcastle and Middlesbrough (in North-East England), as well as speakers of Standard Southern British English (modern RP). In addition to within-accent comparisons, the experiment included between-accent, different-speaker comparisons for Middlesbrough and Newcastle, which are perceptually and regionally proximate accents. Scores were also computed using an x-vector PLDA automatic speaker recognition (ASR) system. The ASR system (EER=10.88 %, <em>C</em><sub>llr</sub>=0.48) outperformed the human listeners (EER=23.55 %, <em>C</em><sub>llr</sub>=0.75) overall and no improvement was found in the ASR output when fused with the listener scores. There was, unsurprisingly, considerable between-listener variability, with individual error rates varying from 0 % to 100 %. Performance was also variable according to the regional accent of the speakers. Notably, the ASR system performed worst with Newcastle samples, while humans performed best with the Newcastle samples. Human listeners were also more sensitive to high-salience between-accent comparisons, leading to almost categorical different-speaker conclusions, compared with the ASR system, whose performance with these samples was similar to within-accent comparisons.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103253"},"PeriodicalIF":2.4,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144072385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual-path and interactive UNET for speech enhancement with multi-order fractional features 多阶分数特征语音增强的双路径交互式UNET

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-09 DOI: 10.1016/j.specom.2025.103248

Liyun Xu, Tong Zhang

{"title":"Dual-path and interactive UNET for speech enhancement with multi-order fractional features","authors":"Liyun Xu, Tong Zhang","doi":"10.1016/j.specom.2025.103248","DOIUrl":"10.1016/j.specom.2025.103248","url":null,"abstract":"<div><div>Preprocessing techniques for denoising and enhancement play a crucial role in significantly improving speech recognition performance. In neural-network-based speech enhancement methods, input features provide the network with essential information to learn from the data. In this study, we introduced multi-order fractional features into a speech enhancement network. These features can represent fine details and offer the advantages of multidomain joint analysis, thereby expanding the input information available to the network. Subsequently, a new dual-path UNET network was designed, in which pure speech and noise are estimated separately. By leveraging the complementarity of the two-branch target estimation, we introduced a fractional information interaction module between the two paths for parameter optimization. Finally, the association module combined the two output information streams to enhance the speech performance. The results from ablation experiments demonstrated the effectiveness of both the multi-order fractional features and the improved dual-path network. Comparison experiments revealed that the proposed algorithm significantly improved speech quality and intelligibility.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103248"},"PeriodicalIF":2.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gradient or categorical? Towards a phonological typology of illusory vowels in Mandarin 梯度还是分类？汉语虚元音的语音类型学研究

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-09 DOI: 10.1016/j.specom.2025.103252

Yizhou Wang , Rikke Bundgaard-Nielsen , Brett Baker , Olga Maxwell

{"title":"Gradient or categorical? Towards a phonological typology of illusory vowels in Mandarin","authors":"Yizhou Wang , Rikke Bundgaard-Nielsen , Brett Baker , Olga Maxwell","doi":"10.1016/j.specom.2025.103252","DOIUrl":"10.1016/j.specom.2025.103252","url":null,"abstract":"<div><div>This paper argues that illusory vowel perception, i.e., the perception of non-existent vowels between two consonants by nonnative listeners, is gradient rather than categorical in Mandarin Chinese, and that the strength of illusion is predictable from the mismatches between the nonnative speech input and the listeners’ native phonological grammar. We examined five phonological scenarios where illusory vowels with different qualities can be perceived, and different illusion levels can be predicted by factors including syllable phonotactic constraints, vowel minimality, and the place of articulation consistency between the illusory vowel and its preceding consonant. The predictions were examined in an AXB discrimination task (Experiment 1) and an identification task (Experiment 2), which confirmed the predictions overall, while some paradigmatic differences were also observed. By comparing the current results and previous reports, we argue that a gradient rather than categorical account of illusory vowel is more suitable for explaining and predicting nonnative cluster perception. Specifically, the place of articulation feature of the preceding consonant is important for predicting contextual illusory vowels, which reflects nonnative listeners’ interpretation of perceived gestural score across multiple segments, supporting a direct realist view of speech perception.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103252"},"PeriodicalIF":2.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

"I said simPle, not symBol!"Is clear speech tailored to the listener's feedback “我说的是简单，不是符号！”清晰的演讲是根据听众的反馈量身定制的吗

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-08 DOI: 10.1016/j.specom.2025.103251

Maëva Garnier, Marion Dohen

{"title":"\"I said simPle, not symBol!\"Is clear speech tailored to the listener's feedback","authors":"Maëva Garnier, Marion Dohen","doi":"10.1016/j.specom.2025.103251","DOIUrl":"10.1016/j.specom.2025.103251","url":null,"abstract":"<div><div>This study investigates variation in the production of French stop consonants in two situations of speech clarity enhancement – when addressing an interlocutor experiencing listening difficulties in a disrupted communication environment (clear speech), and when correcting specific listener misunderstandings (corrected speech). Of interest is whether speech modifications are similar in both situations, or if adjustments during correction specifically address listeners' errors.</div><div>Twelve native French speakers interacted with the experimenter in a gaming task, first in conversational speech ('Conv') under normal conditions, then in clear speech prompted by apparent listening difficulties from the interlocutor ('Clear'). In the disrupted situation, some words were misunderstood by the listener (errors in either voicing or articulation place of stop consonants), resulting in additional corrections by the speaker ('Clear+Corr').</div><div>Significant changes in the timing and spectral cues of stop consonants (closure duration, Voice Onset Time, burst spectrum) were observed in both clear and corrected speech, improving distinctions between voiced and voiceless stops and articulation places. Additionally, clear speech prompted by listening difficulties showed global modifications (overall increased intensity, longer syllable duration, hyper-articulated vowels). Conversely, corrected speech focused solely on segmental modifications, with burst spectrum variations significantly influenced by listener feedback, emphasizing the distinction between the speaker's intended segment and the misunderstood one.</div><div>The results suggest that both situations of speech clarity enhancement involve different strategies, with speech correction relying on real-time perception of the listener's feedback to specifically address perceptual errors.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103251"},"PeriodicalIF":2.4,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144069930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speakers’ communicative intentions lead to acoustic adjustments in native and non-native directed speech 说话者的交际意图导致本族语和非本族语直接语的声学调节

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-08 DOI: 10.1016/j.specom.2025.103250

Giorgio Piazza , Marina Kalashnikova , Laura Fernández-Merino , Clara D. Martin

{"title":"Speakers’ communicative intentions lead to acoustic adjustments in native and non-native directed speech","authors":"Giorgio Piazza , Marina Kalashnikova , Laura Fernández-Merino , Clara D. Martin","doi":"10.1016/j.specom.2025.103250","DOIUrl":"10.1016/j.specom.2025.103250","url":null,"abstract":"<div><div>Speakers adapt acoustic features to factors such as listeners’ linguistic profiles. For instance, addressing a non-native listener elicits Non-Native Directed Speech (NNDS). However, whether these speech adaptations vary depending on the speakers’ didactic goals, in interaction with the listeners' profiles (i.e., native vs. non-native), remains unknown.</div><div>We recorded native Spanish speakers naming novel objects to aid their listeners’ performance in comprehension, pronunciation, and writing tasks. Each speaker interacted with a native (Native Directed Speech, NDS) and a non-native (NNDS) Spanish listener. We extracted measures of vowel hyperarticulation, duration, intensity, speech rate, and F0 to assess listener- and task-specific speech adjustments.</div><div>Our results showed that speakers hyperarticulated vowels to a greater extent in the writing condition compared to the comprehension condition, and during NNDS compared to NDS. Listener profile and task also impacted speakers’ F0 height, intensity, and vowel duration production. Therefore, speakers adjust acoustic features in their speech to achieve their didactic goals and accommodate their listener's profile. Also, speakers’ overall greater adaptation in NNDS than in NDS suggests that NNDS serves a didactic purpose.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103250"},"PeriodicalIF":2.4,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144069931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Early identification of bulbar motor dysfunction in ALS: An approach using AFM signal decomposition 肌萎缩性侧索硬化症患者球运动功能障碍的早期识别：一种利用AFM信号分解的方法

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2025-05-06 DOI: 10.1016/j.specom.2025.103246

Shaik Mulla Shabber , Mohan Bansal

引用次数: 0