{"title":"基于自编码器的连续语音舌形估计","authors":"Vinicius Ribeiro, Y. Laprie","doi":"10.21437/interspeech.2022-10272","DOIUrl":null,"url":null,"abstract":"Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical ar-ticulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data’s encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve’s main components, i.e., the autoencoder’s representation. We show how this approach allows imposing critical articulators’ constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"86-90"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Autoencoder-Based Tongue Shape Estimation During Continuous Speech\",\"authors\":\"Vinicius Ribeiro, Y. Laprie\",\"doi\":\"10.21437/interspeech.2022-10272\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical ar-ticulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data’s encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve’s main components, i.e., the autoencoder’s representation. We show how this approach allows imposing critical articulators’ constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"86-90\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-10272\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-10272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Autoencoder-Based Tongue Shape Estimation During Continuous Speech
Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical ar-ticulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data’s encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve’s main components, i.e., the autoencoder’s representation. We show how this approach allows imposing critical articulators’ constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.