Speech Communication最新文献

筛选
英文 中文
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments 用于噪声语音实验的语言均衡的丹麦语句子视听记录语料库
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-10-09 DOI: 10.1016/j.specom.2024.103141
{"title":"A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments","authors":"","doi":"10.1016/j.specom.2024.103141","DOIUrl":"10.1016/j.specom.2024.103141","url":null,"abstract":"<div><div>A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Forms, factors and functions of phonetic convergence: Editorial 语音趋同的形式、因素和功能社论
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-09-30 DOI: 10.1016/j.specom.2024.103142
{"title":"Forms, factors and functions of phonetic convergence: Editorial","authors":"","doi":"10.1016/j.specom.2024.103142","DOIUrl":"10.1016/j.specom.2024.103142","url":null,"abstract":"<div><div>This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study 利用元音的声学特征估算清醒时上气道横截面积的可行性:试点研究
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-09-28 DOI: 10.1016/j.specom.2024.103144
{"title":"Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study","authors":"","doi":"10.1016/j.specom.2024.103144","DOIUrl":"10.1016/j.specom.2024.103144","url":null,"abstract":"<div><div>Assessment of upper airway dimensions has shown great promise in understanding the pathogenesis of obstructive sleep apnea (OSA). However, the current screening system for OSA does not have an objective assessment of the upper airway. The assessment of the upper airway can accurately be performed by MRI or CT scans, which are costly and not easily accessible. Acoustic pharyngometry or Ultrasonography could be less expensive technologies, but these require trained personnel which makes these technologies not easily accessible, especially when assessing the upper airway in a clinic environment or before surgery. In this study, we aimed to investigate the utility of vowel articulation in assessing the upper airway dimension during normal breathing. To accomplish that, we measured the upper airway cross-sectional area (UA-XSA) by acoustic pharyngometry and then asked the participants to produce 5 vowels for 3 s and recorded them with a microphone. We extracted 710 acoustic features from all vowels and compared these features with UA-XSA and developed regression models to estimate the UA-XSA. Our results showed that Mel frequency cepstral coefficients (MFCC) were the most dominant features of vowels, as 7 out of 9 features were from MFCC in the main feature set. The multiple regression analysis showed that the combination of the acoustic features with the anthropometric features achieved an R<sup>2</sup> of 0.80 in estimating UA-XSA. The important advantage of acoustic analysis of vowel sounds is that it is simple and can be easily implemented in wearable devices or mobile applications. Such acoustic-based technologies can be accessible in different clinical settings such as the intensive care unit and can be used in remote areas. Thus, these results could be used to develop user-friendly applications to use the acoustic features and demographical information to estimate the UA-XSA.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Zero-shot voice conversion based on feature disentanglement 基于特征分解的零镜头语音转换
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-09-27 DOI: 10.1016/j.specom.2024.103143
{"title":"Zero-shot voice conversion based on feature disentanglement","authors":"","doi":"10.1016/j.specom.2024.103143","DOIUrl":"10.1016/j.specom.2024.103143","url":null,"abstract":"<div><div>Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal co-learning for silent speech recognition based on ultrasound tongue images 基于超声舌头图像的无声语音识别多模态协同学习
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-09-12 DOI: 10.1016/j.specom.2024.103140
{"title":"Multi-modal co-learning for silent speech recognition based on ultrasound tongue images","authors":"","doi":"10.1016/j.specom.2024.103140","DOIUrl":"10.1016/j.specom.2024.103140","url":null,"abstract":"<div><p>Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion CLESSR-VC:用于单次语音转换的对比学习增强型自监督表示法
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-09-10 DOI: 10.1016/j.specom.2024.103139
{"title":"CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion","authors":"","doi":"10.1016/j.specom.2024.103139","DOIUrl":"10.1016/j.specom.2024.103139","url":null,"abstract":"<div><p>One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language CSLNSpeech:借助中文手语解决扩展语音分离问题
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-09-02 DOI: 10.1016/j.specom.2024.103131
{"title":"CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language","authors":"","doi":"10.1016/j.specom.2024.103131","DOIUrl":"10.1016/j.specom.2024.103131","url":null,"abstract":"<div><p>Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing neural network architectures for non-intrusive speech quality prediction 比较用于非侵入式语音质量预测的神经网络架构
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-08-30 DOI: 10.1016/j.specom.2024.103123
{"title":"Comparing neural network architectures for non-intrusive speech quality prediction","authors":"","doi":"10.1016/j.specom.2024.103123","DOIUrl":"10.1016/j.specom.2024.103123","url":null,"abstract":"<div><p>Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000943/pdfft?md5=5812564c5b5fd37eb77c86b9c56fb655&pid=1-s2.0-S0167639324000943-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate synthesis of dysarthric Speech for ASR data augmentation 为 ASR 数据扩增准确合成听力障碍语音
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-08-10 DOI: 10.1016/j.specom.2024.103112
{"title":"Accurate synthesis of dysarthric Speech for ASR data augmentation","authors":"","doi":"10.1016/j.specom.2024.103112","DOIUrl":"10.1016/j.specom.2024.103112","url":null,"abstract":"<div><p>Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers.</p><p>This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.</p><p>To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN<img>HMM model trained on additional synthetic dysarthric speech achieves relative Word Error Rate (WER) improvement of 12.2 % compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5 %, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthricness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysarthricness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142096643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CFAD: A Chinese dataset for fake audio detection CFAD:用于假音频检测的中文数据集
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-08-08 DOI: 10.1016/j.specom.2024.103122
{"title":"CFAD: A Chinese dataset for fake audio detection","authors":"","doi":"10.1016/j.specom.2024.103122","DOIUrl":"10.1016/j.specom.2024.103122","url":null,"abstract":"<div><p>Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding (format conversion). CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available.<span><span><sup>1</sup></span></span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141991278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信