Speech Communication最新文献

筛选
英文 中文
Prosodic modulation of discourse markers: A cross-linguistic analysis of conversational dynamics 话语标记的韵律调节:会话动态的跨语言分析
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-06-21 DOI: 10.1016/j.specom.2025.103271
Yi Shan
{"title":"Prosodic modulation of discourse markers: A cross-linguistic analysis of conversational dynamics","authors":"Yi Shan","doi":"10.1016/j.specom.2025.103271","DOIUrl":"10.1016/j.specom.2025.103271","url":null,"abstract":"<div><div>This paper delves into the fascinating world of prosody and pragmatics in discourse markers (DMs). We have come a long way since the early structural approaches, and now we are exploring dynamic models that reveal how prosody shapes DM interpretation in spoken discourse. Our journey takes us through various research methods, from acoustic analysis to naturalistic observations, each offering unique insights into how intonation, stress, and rhythm interact with DMs to guide conversations. Recent cross-linguistic studies, such as Ahn et al. (2024) on Korean “<em>nay mali</em>” and Wang et al. (2024) on Mandarin “<em>haole</em>,” demonstrate how prosodic detachment and contextual cues facilitate the evolution of DMs from lexical to pragmatic functions, underscoring the interplay between prosody and discourse management. Further cross-linguistic evidence comes from Vercher’s (2023) analysis of Spanish “<em>entonces</em>” and Siebold’s (2021) study on German “<em>dann</em>,” which highlight language-specific prosodic realizations of DMs in turn management and conversational closings. We are also looking at cross-linguistic patterns to uncover both universal trends and language-specific characteristics. It is amazing how cultural context plays such a crucial role in prosodic analysis. Besides, machine learning and AI are revolutionizing the field, allowing us to analyze prosodic features in massive datasets with unprecedented precision. We are now embracing multimodal analysis by combining prosody with non-verbal cues for a more holistic understanding of DMs in face-to-face communication. These findings have real-world applications, from improving speech recognition to enhancing language teaching methods. Looking ahead, we are advocating for an integrated approach that considers the dynamic interplay between prosody, pragmatics, and social context. There is still so much to explore across linguistic boundaries and diverse communicative settings. This review is not just a state-of-the-art overview. Rather, it is a roadmap for future research in this exciting field.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103271"},"PeriodicalIF":2.4,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144518752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic speech recognition technology to evaluate an audiometric word recognition test: A preliminary investigation 自动语音识别技术评价听测词识别测试:初步研究
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-06-20 DOI: 10.1016/j.specom.2025.103270
Ayden M. Cauchi , Jaina Negandhi , Sharon L. Cushing , Karen A. Gordon
{"title":"Automatic speech recognition technology to evaluate an audiometric word recognition test: A preliminary investigation","authors":"Ayden M. Cauchi ,&nbsp;Jaina Negandhi ,&nbsp;Sharon L. Cushing ,&nbsp;Karen A. Gordon","doi":"10.1016/j.specom.2025.103270","DOIUrl":"10.1016/j.specom.2025.103270","url":null,"abstract":"<div><div>This study investigated the ability of machine learning systems to score a clinical speech perception test in which monosyllabic words are heard and repeated by a listener. The accuracy score is used in audiometric assessments, including cochlear implant candidacy and monitoring. Scoring is performed by clinicians who listen and judge responses, which can create inter-rater variability and takes clinical time. A machine learning approach could support this testing by providing increased reliability and time efficiency, particularly in children. This study focused on the Phonetically Balanced Kindergarten (PBK) word list. Spoken responses (<em>n</em>=1200) were recorded from 12 adults with normal hearing. These words were presented to 3 automatic speech recognizers (Whisper large, Whisper medium, Ursa) and 7 humans in 7 conditions: unaltered or, to simulate potential speech errors, altered by first or last consonant deletion or low pass filtering at 1, 2, 4, and 6 kHz (<em>n</em>=6972 altered responses). Responses were scored as the same or different from the unaltered target. Data revealed that automatic speech recognizers (ASRs) correctly classified unaltered words similarly to human evaluators across conditions [mean ± 1 SE: Whisper large = 88.20 % ± 1.52 %; Whisper medium = 81.20 % ± 1.52 %; Ursa = 90.70 % ± 1.52 %; humans = 91.80 % ± 2.16 %], [F(3, 3866.2) = 23.63, <em>p</em>&lt;0.001]. Classifications different from the unaltered target occurred most frequently for the first consonant deletion and 1 kHz filtering conditions. Fleiss Kappa metrics showed that ASRs displayed higher agreement than human evaluators across unaltered (ASRs = 0.69; humans = 0.17) and altered (ASRs = 0.56; humans = 0.51) PBK words. These results support the further development of automatic speech recognition systems to support speech perception testing.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103270"},"PeriodicalIF":2.4,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144510758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech stimulus continuum synthesis using deep learning methods 基于深度学习方法的语音刺激连续统合成
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-06-17 DOI: 10.1016/j.specom.2025.103266
Zhu Li, Yuqing Zhang, Yanlu Xie
{"title":"Speech stimulus continuum synthesis using deep learning methods","authors":"Zhu Li,&nbsp;Yuqing Zhang,&nbsp;Yanlu Xie","doi":"10.1016/j.specom.2025.103266","DOIUrl":"10.1016/j.specom.2025.103266","url":null,"abstract":"<div><div>Creating a naturalistic speech stimulus continuum (i.e., a series of stimuli equally spaced along a specific acoustic dimension between two given categories) is an indispensable component in categorical perception studies. A common method is to manually modify the key acoustic parameter of speech sounds, yet the quality of synthetic speech is still unsatisfying. This work explores how to use deep learning techniques for speech stimulus continuum synthesis, with the aim of improving the naturalness of the synthesized continuum. Drawing on recent advances in speech disentanglement learning, we implement a supervised disentanglement framework based on adversarial training (AT) to separate the specific acoustic feature (e.g., fundamental frequency, formant features) from other contents in speech signals and achieve controllable speech stimulus generation by sampling from the latent space of the key acoustic feature. In addition, drawing on the idea of mutual information (MI) in information theory, we design an unsupervised MI-based disentanglement framework to disentangle the specific acoustic feature from other contents in speech signals. Experiments on stimulus generation of several continua validate the effectiveness of our proposed method in both objective and subjective evaluations.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103266"},"PeriodicalIF":2.4,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The perception of intonational peaks and valleys: The effects of plateaux, declination and experimental task 语调峰谷感知:高原、偏角和实验任务的影响
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-06-10 DOI: 10.1016/j.specom.2025.103267
Hae-Sung Jeon
{"title":"The perception of intonational peaks and valleys: The effects of plateaux, declination and experimental task","authors":"Hae-Sung Jeon","doi":"10.1016/j.specom.2025.103267","DOIUrl":"10.1016/j.specom.2025.103267","url":null,"abstract":"<div><div>An experiment assessed listeners’ judgement of either relative pitch height or prominence between two consecutive fundamental frequency (<em>f<sub>o</sub></em>) peaks or valleys in speech. The <em>f<sub>o</sub></em> contour of the first peak or valley was kept constant, while the second was orthogonally manipulated in its height and plateau duration. Half of the stimuli had a flat baseline from which the peaks and valleys were scaled, while the other half had an overtly declining baseline. The results replicated the previous finding that <em>f<sub>o</sub></em> peaks with a long plateau are salient to listeners, while valleys are hard to process even with a plateau. Furthermore, the effect of declination was dependent on the experimental task. Listeners’ responses seemed to be directly affected by the <em>f<sub>o</sub></em> excursion size only for judging relative height between two peaks, while their prominence judgement was strongly affected by the overall impression of the pitch raising or lowering event near the perceptual target. The findings suggest that the global <em>f<sub>o</sub></em> contour, not a single representative <em>f<sub>o</sub></em> value of an intonational event, should be considered in perceptual models of intonation. The findings show an interplay between the signal, listeners’ top-down expectations, and speech perception.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103267"},"PeriodicalIF":2.4,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144288725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A feature engineering approach for literary and colloquial Tamil speech classification using 1D-CNN 使用1D-CNN进行文学和口语化泰米尔语分类的特征工程方法
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-05-29 DOI: 10.1016/j.specom.2025.103254
M. Nanmalar , S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan
{"title":"A feature engineering approach for literary and colloquial Tamil speech classification using 1D-CNN","authors":"M. Nanmalar ,&nbsp;S. Johanan Joysingh ,&nbsp;P. Vijayalakshmi ,&nbsp;T. Nagarajan","doi":"10.1016/j.specom.2025.103254","DOIUrl":"10.1016/j.specom.2025.103254","url":null,"abstract":"<div><div>In ideal human computer interaction (HCI), the colloquial form of a language would be preferred by most users, since it is the form used in their day-to-day conversations. However, there is also an undeniable necessity to preserve the formal literary form. By embracing the new and preserving the old, both service to the common man (practicality) and service to the language itself (conservation) can be rendered. Hence, it is ideal for computers to have the ability to accept, process, and converse in both forms of the language, as required. To address this, it is first necessary to identify the form of the input speech, which in the current work is between literary and colloquial Tamil speech. Such a front-end system must consist of a simple, effective, and lightweight classifier that is trained on a few effective features that are capable of capturing the underlying patterns of the speech signal. To accomplish this, a one-dimensional convolutional neural network (1D-CNN) that learns the envelope of features across time, is proposed. The network is trained on a select number of handcrafted features initially, and then on Mel frequency cepstral coefficients (MFCC) for comparison. The handcrafted features were selected to address various aspects of speech such as the spectral and temporal characteristics, prosody, and voice quality. The features are initially analyzed by considering ten parallel utterances and observing the trend of each feature with respect to time. The proposed 1D-CNN, trained using the handcrafted features, offers an F1 score of 0.9803, while that trained on the MFCC offers an F1 score of 0.9895. In light of this, feature ablation and feature combination are explored. When the best ranked handcrafted features, from the feature ablation study, are combined with the MFCC, they offer the best results with an F1 score of 0.9946.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103254"},"PeriodicalIF":2.4,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144490027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Phonological level wav2vec2-based Mispronunciation Detection and Diagnosis method 基于语音层面wav2vec2的语音错误检测与诊断方法
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-05-23 DOI: 10.1016/j.specom.2025.103249
Mostafa Shahin, Julien Epps, Beena Ahmed
{"title":"Phonological level wav2vec2-based Mispronunciation Detection and Diagnosis method","authors":"Mostafa Shahin,&nbsp;Julien Epps,&nbsp;Beena Ahmed","doi":"10.1016/j.specom.2025.103249","DOIUrl":"10.1016/j.specom.2025.103249","url":null,"abstract":"<div><div>The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. Due to the unpredictable nature of pronunciation errors made by non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches can provide only limited diagnostic information about the error made. To address this, in this paper, we propose a low-level MDD approach based on the detection of phonological features. Phonological features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback for the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive phonological features using a single model. The pre-trained wav2vec2 model was employed as a core model for the phonological feature detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed phonological level MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all phonological features compared to the phoneme-level equivalent.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103249"},"PeriodicalIF":2.4,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144365231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantifying division of labour: Effects of clause type on intonational meaning 量化分工:子句类型对语调意义的影响
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-05-18 DOI: 10.1016/j.specom.2025.103265
Johannes M. Heim
{"title":"Quantifying division of labour: Effects of clause type on intonational meaning","authors":"Johannes M. Heim","doi":"10.1016/j.specom.2025.103265","DOIUrl":"10.1016/j.specom.2025.103265","url":null,"abstract":"<div><div>This paper reports quantifiable evidence for a clean division of labour between syntax and prosody in deriving the meaning of rising intonation. This evidence stems from two perception studies that asked participants to rate the speaker attitudes and their response expectation expressed by rising declaratives and interrogatives. Rises were manipulated by changing pitch excursion and duration which are known to affect their interpretation. One part in each study addressed the relation between the contour shape and the perception of speaker confidence or certainty, another part addressed the relation between the contour shape and the perception of response expectation. The two studies differed in whether the rise was paired with a declarative or an interrogative clause. Across clause-types, higher excursion led to lower ratings of speaker confidence/certainty and higher ratings for response expectation. For declaratives only, large duration differences also affected ratings of speaker confidence. While the patterns emerging about prosodic form-function mapping were similar across clause types, the effect sizes differed notably. This suggests that pitch excursion, and possibly duration, have clause-type-independent effects, which are moderated by default expectations of contour and clause-type combinations. Such an interpretation supports previous compositional accounts of intonational meaning that ascribe independent functions to clause type and intonation, each contributing to their conversational effects.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103265"},"PeriodicalIF":2.4,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144185699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Blood pressure monitoring from naturally recorded speech sounds: advancements and future prospects 从自然录制的语音中监测血压:进步和未来前景
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-05-13 DOI: 10.1016/j.specom.2025.103255
Fikret Arı , Haydar Ankışhan , Blaise B. Frederick , Lia M. Hocke , Sinem B. Erdoğan
{"title":"Blood pressure monitoring from naturally recorded speech sounds: advancements and future prospects","authors":"Fikret Arı ,&nbsp;Haydar Ankışhan ,&nbsp;Blaise B. Frederick ,&nbsp;Lia M. Hocke ,&nbsp;Sinem B. Erdoğan","doi":"10.1016/j.specom.2025.103255","DOIUrl":"10.1016/j.specom.2025.103255","url":null,"abstract":"<div><div>The development of an accurate, cuffless system for continuous monitoring of blood pressure is essential to reduce the number of deaths due to hypertension. In this study, we present a groundbreaking artificial intelligence-based system developed for accurate blood pressure prediction from spoken sentences in natural everyday situations, using only a smartphone without additional measurements. Our method uses hyperparameter-tuned machine learning (ML) techniques, including Synthetic Minority Over-sampling Technique (SMOTE), to classify blood pressure as normal or high. By automatically detecting vowels in recorded speech sentences, we extract a statistical features vector with demographic information (1 × 59-D). Experimental results highlight impressive classification accuracies, reaching 98.45% for systolic BP and 99.61% for diastolic BP with the Adaptive synthetic sampling approach for imbalanced learning (ADASYN). These findings underscore the meaningful physiological information embedded in human speech and demonstrate the potential of our hyperparameter-tuned ML methods in revolutionizing health monitoring practices, particularly in the domain of telehealth, internet of things devices and remote monitoring.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103255"},"PeriodicalIF":2.4,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human and automatic voice comparison with regionally variable speech samples 人类和自动语音比较区域可变语音样本
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-05-12 DOI: 10.1016/j.specom.2025.103253
Vincent Hughes , Carmen Llamas , Thomas Kettig
{"title":"Human and automatic voice comparison with regionally variable speech samples","authors":"Vincent Hughes ,&nbsp;Carmen Llamas ,&nbsp;Thomas Kettig","doi":"10.1016/j.specom.2025.103253","DOIUrl":"10.1016/j.specom.2025.103253","url":null,"abstract":"<div><div>In this paper, we compare and combine human and automatic voice comparison results based on short, regionally variable speech samples. Likelihood ratio-like scores were extracted for 120 pairs of same- (45) and different-speaker (75) samples from a total of 896 British English listeners. The samples contained the voices of speakers from Newcastle and Middlesbrough (in North-East England), as well as speakers of Standard Southern British English (modern RP). In addition to within-accent comparisons, the experiment included between-accent, different-speaker comparisons for Middlesbrough and Newcastle, which are perceptually and regionally proximate accents. Scores were also computed using an x-vector PLDA automatic speaker recognition (ASR) system. The ASR system (EER=10.88 %, <em>C</em><sub>llr</sub>=0.48) outperformed the human listeners (EER=23.55 %, <em>C</em><sub>llr</sub>=0.75) overall and no improvement was found in the ASR output when fused with the listener scores. There was, unsurprisingly, considerable between-listener variability, with individual error rates varying from 0 % to 100 %. Performance was also variable according to the regional accent of the speakers. Notably, the ASR system performed worst with Newcastle samples, while humans performed best with the Newcastle samples. Human listeners were also more sensitive to high-salience between-accent comparisons, leading to almost categorical different-speaker conclusions, compared with the ASR system, whose performance with these samples was similar to within-accent comparisons.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103253"},"PeriodicalIF":2.4,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144072385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual-path and interactive UNET for speech enhancement with multi-order fractional features 多阶分数特征语音增强的双路径交互式UNET
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2025-05-09 DOI: 10.1016/j.specom.2025.103248
Liyun Xu, Tong Zhang
{"title":"Dual-path and interactive UNET for speech enhancement with multi-order fractional features","authors":"Liyun Xu,&nbsp;Tong Zhang","doi":"10.1016/j.specom.2025.103248","DOIUrl":"10.1016/j.specom.2025.103248","url":null,"abstract":"<div><div>Preprocessing techniques for denoising and enhancement play a crucial role in significantly improving speech recognition performance. In neural-network-based speech enhancement methods, input features provide the network with essential information to learn from the data. In this study, we introduced multi-order fractional features into a speech enhancement network. These features can represent fine details and offer the advantages of multidomain joint analysis, thereby expanding the input information available to the network. Subsequently, a new dual-path UNET network was designed, in which pure speech and noise are estimated separately. By leveraging the complementarity of the two-branch target estimation, we introduced a fractional information interaction module between the two paths for parameter optimization. Finally, the association module combined the two output information streams to enhance the speech performance. The results from ablation experiments demonstrated the effectiveness of both the multi-order fractional features and the improved dual-path network. Comparison experiments revealed that the proposed algorithm significantly improved speech quality and intelligibility.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103248"},"PeriodicalIF":2.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信