Speech Communication最新文献

筛选
英文 中文
Speech-driven head motion generation from waveforms 根据波形生成语音驱动的头部动作
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-03-01 DOI: 10.1016/j.specom.2024.103056
JinHong Lu, Hiroshi Shimodaira
{"title":"Speech-driven head motion generation from waveforms","authors":"JinHong Lu,&nbsp;Hiroshi Shimodaira","doi":"10.1016/j.specom.2024.103056","DOIUrl":"10.1016/j.specom.2024.103056","url":null,"abstract":"<div><p>Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>, shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>-based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>feature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>was chosen to be better than the other models by the participants of MUSHRA test.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103056"},"PeriodicalIF":3.2,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000281/pdfft?md5=3e4ce95ea878ead804890332c3362074&pid=1-s2.0-S0167639324000281-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140089565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PLDE: A lightweight pooling layer for spoken language recognition PLDE:用于口语识别的轻量级汇集层
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-23 DOI: 10.1016/j.specom.2024.103055
Zimu Li , Yanyan Xu , Dengfeng Ke , Kaile Su
{"title":"PLDE: A lightweight pooling layer for spoken language recognition","authors":"Zimu Li ,&nbsp;Yanyan Xu ,&nbsp;Dengfeng Ke ,&nbsp;Kaile Su","doi":"10.1016/j.specom.2024.103055","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103055","url":null,"abstract":"<div><p>In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the <span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>a</mi><mi>v</mi><mi>g</mi></mrow></msub></math></span> of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103055"},"PeriodicalIF":3.2,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139943007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pre-trained models for detection and severity level classification of dysarthria from speech 用于从语音中检测构音障碍并对其严重程度进行分类的预训练模型
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-14 DOI: 10.1016/j.specom.2024.103047
Farhad Javanmardi, Sudarsana Reddy Kadiri, Paavo Alku
{"title":"Pre-trained models for detection and severity level classification of dysarthria from speech","authors":"Farhad Javanmardi,&nbsp;Sudarsana Reddy Kadiri,&nbsp;Paavo Alku","doi":"10.1016/j.specom.2024.103047","DOIUrl":"10.1016/j.specom.2024.103047","url":null,"abstract":"<div><p>Automatic detection and severity level classification of dysarthria from speech enables non-invasive and effective diagnosis that helps clinical decisions about medication and therapy of patients. In this work, three pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) are studied to extract features to build automatic detection and severity level classification systems for dysarthric speech. The experiments were conducted using two publicly available databases (UA-Speech and TORGO). One machine learning-based model (support vector machine, SVM) and one deep learning-based model (convolutional neural network, CNN) was used as the classifier. In order to compare the performance of the wav2vec2-BASE, wav2vec2-LARGE, and HuBERT features, three popular acoustic feature sets, namely, mel-frequency cepstral coefficients (MFCCs), openSMILE and extended Geneva minimalistic acoustic parameter set (eGeMAPS) were considered. Experimental results revealed that the features derived from the pre-trained models outperformed the three baseline features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features. In particular, when compared to the best-performing baseline feature (openSMILE), the HuBERT features showed in the detection problem absolute accuracy improvements that varied between 1.33% (the SVM classifier, the TORGO database) and 2.86% (the SVM classifier, the UA-Speech database). In the severity level classification problem, the HuBERT features showed absolute accuracy improvements that varied between 6.54% (the SVM classifier, the TORGO database) and 10.46% (the SVM classifier, the UA-Speech database) compared to the best-performing baseline feature (eGeMAPS).</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103047"},"PeriodicalIF":3.2,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000190/pdfft?md5=06e82e9568d6d0d206292d39eb27d9c4&pid=1-s2.0-S0167639324000190-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139877101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On intrusive speech quality measures and a global SNR based metric 关于干扰性语音质量测量和基于信噪比的全局指标
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-14 DOI: 10.1016/j.specom.2024.103044
Chao Pan , Jingdong Chen , Jacob Benesty
{"title":"On intrusive speech quality measures and a global SNR based metric","authors":"Chao Pan ,&nbsp;Jingdong Chen ,&nbsp;Jacob Benesty","doi":"10.1016/j.specom.2024.103044","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103044","url":null,"abstract":"<div><p>Measuring the quality of noisy speech signals has been an increasingly important problem in the field of speech processing as more and more speech-communication and human-machine-interface systems are deployed in practical applications. In this paper, we study four widely used classical performance measures: signal-to-distortion ratio (SDR), short-time objective intelligibility (STOI), signal-to-noise ratio (SNR), and perceptual evaluation of speech quality (PESQ). Through analyzing these performance measures under the same framework and identifying the relationship between their core parameters, we convert these measures into the corresponding equivalent SNRs. This conversion enables not only some new insights into different quality measures but also a way to combine these measures into a new metric. In the derivation of the equivalent SNRs, we introduce the widely used masking technique into the computation of correlation coefficients, which is subsequently used to analyze STOI. Furthermore, we propose an attention method to compute the core parameters of PESQ, and also an empirical formula to project the equivalent SNRs into PESQ scores. Experiments are carried out and the results justifies the properties of the derived quality measures.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103044"},"PeriodicalIF":3.2,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139749477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deletion and insertion tampering detection for speech authentication based on fluctuating super vector of electrical network frequency 基于电网频率波动超矢量的语音认证删除和插入篡改检测
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-12 DOI: 10.1016/j.specom.2024.103046
Chunyan Zeng , Shuai Kong , Zhifeng Wang , Shixiong Feng , Nan Zhao , Juan Wang
{"title":"Deletion and insertion tampering detection for speech authentication based on fluctuating super vector of electrical network frequency","authors":"Chunyan Zeng ,&nbsp;Shuai Kong ,&nbsp;Zhifeng Wang ,&nbsp;Shixiong Feng ,&nbsp;Nan Zhao ,&nbsp;Juan Wang","doi":"10.1016/j.specom.2024.103046","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103046","url":null,"abstract":"<div><p>The current digital speech deletion and insertion tampering detection methods mainly employes the extraction of phase and frequency features of the Electrical Network Frequency (ENF). However, there are some problems with the existing approaches, such as the alignment problem for speech samples with different durations, the sparsity of ENF features, and the small number of tampered speech samples for training, which lead to low accuracy of deletion and insertion tampering detection. Therefore, this paper proposes a tampering detection method for digital speech deletion and insertion based on ENF Fluctuation Super-vector (ENF-FSV) and deep feature learning representation. By extracting the parameters of ENF phase and frequency fitting curves, feature alignment and dimensionality reduction are achieved, and the alignment and sparsity problems are avoided while the fluctuation information of phase and frequency is extracted. To solve the problem of the insufficient sample size of tampered speech for training, the ENF Universal Background Model (ENF-UBM) is built by a large number of untampered speech samples, and the mean vector is updated to extract ENF-FSV. Considering the shallow representation of ENF features with not highlighting important features, we construct an end-to-end deep neural network to strengthen the attention to the abrupt fluctuation information by the attention mechanism to enhance the representational power of the ENF-FSV features, and then the deep ENF-FSV features extracted by the Residual Network (ResNet) module are fed to the designed classification network for tampering detection. The experimental results show that the method in this paper exhibits higher accuracy and better robustness in the Carioca, New Spanish, and ENF High-sampling Group (ENF-HG) databases when compared with the state-of-the-art methods.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103046"},"PeriodicalIF":3.2,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139725832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Some properties of mental speech preparation as revealed by self-monitoring 自我监控揭示的心理演讲准备的一些特性
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-09 DOI: 10.1016/j.specom.2024.103043
Hugo Quené, Sieb G. Nooteboom
{"title":"Some properties of mental speech preparation as revealed by self-monitoring","authors":"Hugo Quené,&nbsp;Sieb G. Nooteboom","doi":"10.1016/j.specom.2024.103043","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103043","url":null,"abstract":"<div><p>The main goal of this paper is to improve our insight in the mental preparation of speech, based on speakers' self-monitoring behavior. To this end we re-analyze the aggregated responses from earlier published experiments eliciting speech sound errors. The re-analyses confirm or show that (1) “early” and “late” detections of elicited speech sound errors can be distinguished, with a time delay in the order of 500 ms; (2) a main cause for some errors to be detected “early”, others “late” and others again not at all is the size of the phonetic contrast between the error and the target speech sound; (3) repairs of speech sound errors stem from competing (and sometimes active) word candidates. These findings lead to some speculative conclusions regarding the mental preparation of speech. First, there are two successive stages of mental preparation, an “early” and a “late” stage. Second, at the “early” stage of speech preparation, speech sounds are represented as targets in auditory perceptual space, at the “late” stage as coordinated motor commands necessary for articulation. Third, repairs of speech sound errors stem from response candidates competing for the same slot with the error form, and some activation often is sustained until after articulation.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103043"},"PeriodicalIF":3.2,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000153/pdfft?md5=0778601c47d5f7635cc40d5c60526a59&pid=1-s2.0-S0167639324000153-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139738033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions 在办案条件下验证用于法证自动语音识别的 ECAPA-TDNN 系统
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-09 DOI: 10.1016/j.specom.2024.103045
Francesco Sigona, Mirko Grimaldi
{"title":"Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions","authors":"Francesco Sigona,&nbsp;Mirko Grimaldi","doi":"10.1016/j.specom.2024.103045","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103045","url":null,"abstract":"<div><p>In this work, we tested different variants of a Forensic Automatic Speaker Recognition (FASR) system based on Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN). To this scope, conditions reflecting those of a real forensic voice comparison case have been taken into consideration according to the <em>forensic_eval_01</em> evaluation campaign settings. Using this recent neural model as an embedding extraction block, various normalization strategies at the level of embeddings and scores allowed us to observe the variations in system performance in terms of discriminating power, accuracy and precision metrics. Our findings suggest that the ECAPA-TDNN can be successfully used as a base component of a FASR system, managing to surpass the previous state of the art, at least in the context of the considered operating conditions.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103045"},"PeriodicalIF":3.2,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000177/pdfft?md5=4a1c2390e5be4931eca4de00e7d357e7&pid=1-s2.0-S0167639324000177-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139914548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic classification of neurological voice disorders using wavelet scattering features 利用小波散射特征对神经性嗓音疾病进行自动分类
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-01 DOI: 10.1016/j.specom.2024.103040
Madhu Keerthana Yagnavajjula , Kiran Reddy Mittapalle , Paavo Alku , Sreenivasa Rao K. , Pabitra Mitra
{"title":"Automatic classification of neurological voice disorders using wavelet scattering features","authors":"Madhu Keerthana Yagnavajjula ,&nbsp;Kiran Reddy Mittapalle ,&nbsp;Paavo Alku ,&nbsp;Sreenivasa Rao K. ,&nbsp;Pabitra Mitra","doi":"10.1016/j.specom.2024.103040","DOIUrl":"10.1016/j.specom.2024.103040","url":null,"abstract":"<div><p>Neurological voice disorders are caused by problems in the nervous system as it interacts with the larynx. In this paper, we propose to use wavelet scattering transform (WST)-based features in automatic classification of neurological voice disorders. As a part of WST, a speech signal is processed in stages with each stage consisting of three operations – convolution, modulus and averaging – to generate low-variance data representations that preserve discriminability across classes while minimizing differences within a class. The proposed WST-based features were extracted from speech signals of patients suffering from either spasmodic dysphonia (SD) or recurrent laryngeal nerve palsy (RLNP) and from speech signals of healthy speakers of the Saarbruecken voice disorder (SVD) database. Two machine learning algorithms (support vector machine (SVM) and feed forward neural network (NN)) were trained separately using the WST-based features, to perform two binary classification tasks (healthy vs. SD and healthy vs. RLNP) and one multi-class classification task (healthy vs. SD vs. RLNP). The results show that WST-based features outperformed state-of-the-art features in all three tasks. Furthermore, the best overall classification performance was achieved by the NN classifier trained using WST-based features.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103040"},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000128/pdfft?md5=98a659d5cd3309ac33e76a42084db6ed&pid=1-s2.0-S0167639324000128-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139589964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AVID: A speech database for machine learning studies on vocal intensity AVID:用于声音强度机器学习研究的语音数据库
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-01 DOI: 10.1016/j.specom.2024.103039
Paavo Alku , Manila Kodali , Laura Laaksonen , Sudarsana Reddy Kadiri
{"title":"AVID: A speech database for machine learning studies on vocal intensity","authors":"Paavo Alku ,&nbsp;Manila Kodali ,&nbsp;Laura Laaksonen ,&nbsp;Sudarsana Reddy Kadiri","doi":"10.1016/j.specom.2024.103039","DOIUrl":"10.1016/j.specom.2024.103039","url":null,"abstract":"<div><p>Vocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded together with speech. However, most of the popular databases that are used in areas such as speech and speaker recognition have been recorded without calibration information by expressing speech on arbitrary amplitude scales. Therefore, information about vocal intensity of the recorded speech, including SPL, is lost. In the current study, we introduce a new open and calibrated speech/electroglottography (EGG) database named Aalto Vocal Intensity Database (AVID). AVID includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud and very loud). Recordings were conducted using a constant mouth-to-microphone distance and by recording a calibration tone. The speech data was labelled sentence-wise using a total of 19 labels that support the utilisation of the data in machine learning (ML) -based studies of vocal intensity based on supervised learning. In order to demonstrate how the AVID data can be used to study vocal intensity, we investigated one multi-class classification task (classification of speech into soft, normal, loud and very loud intensity classes) and one regression task (prediction of SPL of speech). In both tasks, we deliberately warped the level of the input speech by normalising the signal to have its maximum amplitude equal to 1.0, that is, we simulated a scenario that is prevalent in current speech databases. The results show that using the spectrogram feature with the support vector machine classifier gave an accuracy of 82% in the multi-class classification of the vocal intensity category. In the prediction of SPL, using the spectrogram feature with the support vector regressor gave an mean absolute error of about 2 dB and a coefficient of determination of 92%. We welcome researchers interested in classification and regression problems to utilise AVID in the study of vocal intensity, and we hope that the current results could serve as baselines for future ML studies on the topic.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103039"},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000116/pdfft?md5=c116ec551b37da3e4f4867e6d11803ea&pid=1-s2.0-S0167639324000116-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139560424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Monophthong vocal tract shapes are sufficient for articulatory synthesis of German primary diphthongs 单音声道形状足以发音合成德语初级双元音
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-02-01 DOI: 10.1016/j.specom.2024.103041
Simon Stone, Peter Birkholz
{"title":"Monophthong vocal tract shapes are sufficient for articulatory synthesis of German primary diphthongs","authors":"Simon Stone,&nbsp;Peter Birkholz","doi":"10.1016/j.specom.2024.103041","DOIUrl":"10.1016/j.specom.2024.103041","url":null,"abstract":"<div><p><span>German primary diphthongs are conventionally transcribed using the same symbols used for some monophthong vowels. However, if the corresponding vocal tract shapes are used for articulatory synthesis, the results often sound unnatural. Furthermore, there is no clear consensus in the literature if diphthongs have monopthong constituents and if so, which ones. This study therefore analyzed a set of audio recordings from the reference speaker of the state-of-the-art articulatory synthesizer VocalTractLab to identify likely candidates for the monophthong constituents of the German primary diphthongs. We then evaluated these candidates in a listening experiment with naive listeners to determine a </span>naturalness ranking of these candidates and specialized diphthong shapes. The results showed that the German primary diphthongs can indeed be synthesized with no significant loss in naturalness by replacing the specialized diphthong shapes for the initial and final segments by shapes also used for monopthong vowels.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103041"},"PeriodicalIF":3.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139589966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信