用于发音-语音合成的超声舌图像数据增强方法

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI:10.21437/ssw.2023-36

I. Ibrahimov, G. Gosztolya, T. Csapó

{"title":"用于发音-语音合成的超声舌图像数据增强方法","authors":"I. Ibrahimov, G. Gosztolya, T. Csapó","doi":"10.21437/ssw.2023-36","DOIUrl":null,"url":null,"abstract":"Articulation-to-Speech Synthesis (ATS) focuses on converting articulatory biosignal information into audible speech, nowadays mostly using DNNs, with a future target application of a Silent Speech Interface. Ultrasound Tongue Imaging (UTI) is an affordable and non-invasive technique that has become popular for collecting articulatory data. Data augmentation has been shown to improve the generalization ability of DNNs, e.g. to avoid overfitting, introduce variations into the existing dataset, or make the network more robust against various noise types on the input data. In this paper, we compare six different data augmentation methods on the UltraSuite-TaL corpus during UTI-based ATS using CNNs. Validation mean squared error is used to evaluate the performance of CNNs, while by the synthesized speech samples, the performace of direct ATS is measured us-ing MCD and PESQ scores. Although we did not find large differences in the outcome of various data augmentation techniques, the results of this study suggest that while applying data augmentation techniques on UTI poses some challenges due to the unique nature of the data, it provides benefits in terms of enhancing the robustness of neural networks. In general, articulatory control might be beneficial in TTS as well.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis\",\"authors\":\"I. Ibrahimov, G. Gosztolya, T. Csapó\",\"doi\":\"10.21437/ssw.2023-36\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Articulation-to-Speech Synthesis (ATS) focuses on converting articulatory biosignal information into audible speech, nowadays mostly using DNNs, with a future target application of a Silent Speech Interface. Ultrasound Tongue Imaging (UTI) is an affordable and non-invasive technique that has become popular for collecting articulatory data. Data augmentation has been shown to improve the generalization ability of DNNs, e.g. to avoid overfitting, introduce variations into the existing dataset, or make the network more robust against various noise types on the input data. In this paper, we compare six different data augmentation methods on the UltraSuite-TaL corpus during UTI-based ATS using CNNs. Validation mean squared error is used to evaluate the performance of CNNs, while by the synthesized speech samples, the performace of direct ATS is measured us-ing MCD and PESQ scores. Although we did not find large differences in the outcome of various data augmentation techniques, the results of this study suggest that while applying data augmentation techniques on UTI poses some challenges due to the unique nature of the data, it provides benefits in terms of enhancing the robustness of neural networks. In general, articulatory control might be beneficial in TTS as well.\",\"PeriodicalId\":346639,\"journal\":{\"name\":\"12th ISCA Speech Synthesis Workshop (SSW2023)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"12th ISCA Speech Synthesis Workshop (SSW2023)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/ssw.2023-36\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

发音-语音合成(artication -to- speech Synthesis, ATS)侧重于将发音生物信号信息转换为可听语音，目前主要使用深度神经网络，未来的目标应用是无声语音接口。超声舌头成像(UTI)是一种经济实惠的非侵入性技术，已成为流行的收集发音数据。数据增强已被证明可以提高dnn的泛化能力，例如避免过拟合，在现有数据集中引入变化，或使网络对输入数据上的各种噪声类型更具鲁棒性。在本文中，我们比较了使用cnn在基于uti的ATS中对UltraSuite-TaL语料库的六种不同的数据增强方法。使用验证均方误差来评估cnn的性能，而通过合成语音样本，使用MCD和PESQ分数来衡量直接ATS的性能。虽然我们没有发现各种数据增强技术的结果有很大差异，但本研究的结果表明，虽然由于数据的独特性，在UTI上应用数据增强技术会带来一些挑战，但它在增强神经网络的鲁棒性方面提供了好处。一般来说，发音控制在TTS中也是有益的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis

Articulation-to-Speech Synthesis (ATS) focuses on converting articulatory biosignal information into audible speech, nowadays mostly using DNNs, with a future target application of a Silent Speech Interface. Ultrasound Tongue Imaging (UTI) is an affordable and non-invasive technique that has become popular for collecting articulatory data. Data augmentation has been shown to improve the generalization ability of DNNs, e.g. to avoid overfitting, introduce variations into the existing dataset, or make the network more robust against various noise types on the input data. In this paper, we compare six different data augmentation methods on the UltraSuite-TaL corpus during UTI-based ATS using CNNs. Validation mean squared error is used to evaluate the performance of CNNs, while by the synthesized speech samples, the performace of direct ATS is measured us-ing MCD and PESQ scores. Although we did not find large differences in the outcome of various data augmentation techniques, the results of this study suggest that while applying data augmentation techniques on UTI poses some challenges due to the unique nature of the data, it provides benefits in terms of enhancing the robustness of neural networks. In general, articulatory control might be beneficial in TTS as well.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

自引率

0.00%

发文量