{"title":"基于深度学习的三维理论声道模型的声学-发音反转","authors":"Thanat Lapthawan, S. Prom-on","doi":"10.1109/ICAwST.2019.8923588","DOIUrl":null,"url":null,"abstract":"This paper presents an acoustic-to-articulatory mapping of a three-dimensional theoretical vocal tract model using deep learning methods. Prominent deep learning-based network structures are explored and evaluated for their suitability in capturing the relationship between acoustic and articulatory-oriented vocal tract parameters. The dataset was synthesized from VocalTractLab, a three-dimensional theoretical articulatory synthesizer, in forms of the pairs of acoustic, represented by Mel-frequency cepstral coefficients (MFCCs), and articulatory signals, represented by 23 vocal tract parameters. The sentence structure used in the dataset generation were both monosyllabic and disyllabic vowel articulations. Models were evaluated using the root-mean-square error (RMSE) and R-squared (R2). The deep artificial neural network architecture (DNN), regulating using batch normalization, achieves the best performance for both inversion tasks, RMSE of 0.015 and R2 of 0.970 for monosyllabic vowels and RMSE of 0.015and R2 of 0.975 for disyllabic vowels. The comparison, between a formant of a sound from inverted articulatory parameters and the original synthesized sound, demonstrates that there is no statistically different between original and estimated parameters. The results indicate that the deep learning-based model is effectively estimated articulatory parameters in a three-dimensional space of a vocal tract model.","PeriodicalId":156538,"journal":{"name":"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Acoustic-to-Articulatory Inversion of a Three-dimensional Theoretical Vocal Tract Model Using Deep Learning-based Model\",\"authors\":\"Thanat Lapthawan, S. Prom-on\",\"doi\":\"10.1109/ICAwST.2019.8923588\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an acoustic-to-articulatory mapping of a three-dimensional theoretical vocal tract model using deep learning methods. Prominent deep learning-based network structures are explored and evaluated for their suitability in capturing the relationship between acoustic and articulatory-oriented vocal tract parameters. The dataset was synthesized from VocalTractLab, a three-dimensional theoretical articulatory synthesizer, in forms of the pairs of acoustic, represented by Mel-frequency cepstral coefficients (MFCCs), and articulatory signals, represented by 23 vocal tract parameters. The sentence structure used in the dataset generation were both monosyllabic and disyllabic vowel articulations. Models were evaluated using the root-mean-square error (RMSE) and R-squared (R2). The deep artificial neural network architecture (DNN), regulating using batch normalization, achieves the best performance for both inversion tasks, RMSE of 0.015 and R2 of 0.970 for monosyllabic vowels and RMSE of 0.015and R2 of 0.975 for disyllabic vowels. The comparison, between a formant of a sound from inverted articulatory parameters and the original synthesized sound, demonstrates that there is no statistically different between original and estimated parameters. The results indicate that the deep learning-based model is effectively estimated articulatory parameters in a three-dimensional space of a vocal tract model.\",\"PeriodicalId\":156538,\"journal\":{\"name\":\"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAwST.2019.8923588\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAwST.2019.8923588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Acoustic-to-Articulatory Inversion of a Three-dimensional Theoretical Vocal Tract Model Using Deep Learning-based Model
This paper presents an acoustic-to-articulatory mapping of a three-dimensional theoretical vocal tract model using deep learning methods. Prominent deep learning-based network structures are explored and evaluated for their suitability in capturing the relationship between acoustic and articulatory-oriented vocal tract parameters. The dataset was synthesized from VocalTractLab, a three-dimensional theoretical articulatory synthesizer, in forms of the pairs of acoustic, represented by Mel-frequency cepstral coefficients (MFCCs), and articulatory signals, represented by 23 vocal tract parameters. The sentence structure used in the dataset generation were both monosyllabic and disyllabic vowel articulations. Models were evaluated using the root-mean-square error (RMSE) and R-squared (R2). The deep artificial neural network architecture (DNN), regulating using batch normalization, achieves the best performance for both inversion tasks, RMSE of 0.015 and R2 of 0.970 for monosyllabic vowels and RMSE of 0.015and R2 of 0.975 for disyllabic vowels. The comparison, between a formant of a sound from inverted articulatory parameters and the original synthesized sound, demonstrates that there is no statistically different between original and estimated parameters. The results indicate that the deep learning-based model is effectively estimated articulatory parameters in a three-dimensional space of a vocal tract model.