Acoustic-to-Articulatory Inversion of a Three-dimensional Theoretical Vocal Tract Model Using Deep Learning-based Model

2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST) Pub Date : 2019-10-01 DOI:10.1109/ICAwST.2019.8923588

Thanat Lapthawan, S. Prom-on

{"title":"Acoustic-to-Articulatory Inversion of a Three-dimensional Theoretical Vocal Tract Model Using Deep Learning-based Model","authors":"Thanat Lapthawan, S. Prom-on","doi":"10.1109/ICAwST.2019.8923588","DOIUrl":null,"url":null,"abstract":"This paper presents an acoustic-to-articulatory mapping of a three-dimensional theoretical vocal tract model using deep learning methods. Prominent deep learning-based network structures are explored and evaluated for their suitability in capturing the relationship between acoustic and articulatory-oriented vocal tract parameters. The dataset was synthesized from VocalTractLab, a three-dimensional theoretical articulatory synthesizer, in forms of the pairs of acoustic, represented by Mel-frequency cepstral coefficients (MFCCs), and articulatory signals, represented by 23 vocal tract parameters. The sentence structure used in the dataset generation were both monosyllabic and disyllabic vowel articulations. Models were evaluated using the root-mean-square error (RMSE) and R-squared (R2). The deep artificial neural network architecture (DNN), regulating using batch normalization, achieves the best performance for both inversion tasks, RMSE of 0.015 and R2 of 0.970 for monosyllabic vowels and RMSE of 0.015and R2 of 0.975 for disyllabic vowels. The comparison, between a formant of a sound from inverted articulatory parameters and the original synthesized sound, demonstrates that there is no statistically different between original and estimated parameters. The results indicate that the deep learning-based model is effectively estimated articulatory parameters in a three-dimensional space of a vocal tract model.","PeriodicalId":156538,"journal":{"name":"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAwST.2019.8923588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This paper presents an acoustic-to-articulatory mapping of a three-dimensional theoretical vocal tract model using deep learning methods. Prominent deep learning-based network structures are explored and evaluated for their suitability in capturing the relationship between acoustic and articulatory-oriented vocal tract parameters. The dataset was synthesized from VocalTractLab, a three-dimensional theoretical articulatory synthesizer, in forms of the pairs of acoustic, represented by Mel-frequency cepstral coefficients (MFCCs), and articulatory signals, represented by 23 vocal tract parameters. The sentence structure used in the dataset generation were both monosyllabic and disyllabic vowel articulations. Models were evaluated using the root-mean-square error (RMSE) and R-squared (R2). The deep artificial neural network architecture (DNN), regulating using batch normalization, achieves the best performance for both inversion tasks, RMSE of 0.015 and R2 of 0.970 for monosyllabic vowels and RMSE of 0.015and R2 of 0.975 for disyllabic vowels. The comparison, between a formant of a sound from inverted articulatory parameters and the original synthesized sound, demonstrates that there is no statistically different between original and estimated parameters. The results indicate that the deep learning-based model is effectively estimated articulatory parameters in a three-dimensional space of a vocal tract model.

查看原文本刊更多论文

基于深度学习的三维理论声道模型的声学-发音反转

本文提出了一个声学到发音映射的三维理论声道模型使用深度学习方法。突出的深度学习为基础的网络结构进行了探索和评估，以获取声学和发音导向的声道参数之间的关系的适用性。该数据集由三维理论发音合成器VocalTractLab合成，以mel频率背谱系数(MFCCs)表示的声学对和由23个声道参数表示的发音信号的形式合成。数据集生成中使用的句子结构是单音节和双音节的元音发音。采用均方根误差(RMSE)和r平方(R2)对模型进行评价。采用批归一化调节的深度人工神经网络架构(deep artificial neural network architecture, DNN)在两项反演任务中均取得最佳表现，单音节元音的RMSE为0.015,R2为0.970，双音节元音的RMSE为0.015,R2为0.975。通过对一个由倒转的发音参数组成的音的共振峰与原始合成音的比较，可以发现原始参数与估计参数之间没有统计学差异。结果表明，基于深度学习的模型可以有效地估计声道模型三维空间中的发音参数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)

自引率

0.00%

发文量