Kishore Kumar Botsa, Lithin Reddy Marla, S. Gangashetty
{"title":"基于生成对抗网络的噪声环境下鲁棒TTS训练框架","authors":"Kishore Kumar Botsa, Lithin Reddy Marla, S. Gangashetty","doi":"10.1145/3474124.3474163","DOIUrl":null,"url":null,"abstract":"Humans overcome the degradation effect of background noise on the speech by changing their vocal characteristics dynamically, which text-to-speech systems trained on clean speech cannot, resulting in degraded intelligibility. To improve the intelligibility of such system requires a large amount of speech samples, which is difficult to collect for various conditions like noise backgrounds and signal-to-noise ratios. This paper presents a noise dependent enhancement to text-to-speech, based on generative adversarial network training framework to generate intelligible speech in noise. The learning mechanism for the synthesizer network is inspired from the acoustic feedback humans use to nullify the effect of various background noises. The system thus trained is evaluated under cafeteria noise condition with two objective measures, which indicated improvement in intelligibility compared to models trained on clean speech across 3 SNRs. The proposed modification does not require any additional training data and can be applied to a variety of Deep Neural Networks that employ back-propagation algorithm for training.","PeriodicalId":144611,"journal":{"name":"2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Generative Adversarial Network based Training Framework for Robust TTS in Noisy Environment\",\"authors\":\"Kishore Kumar Botsa, Lithin Reddy Marla, S. Gangashetty\",\"doi\":\"10.1145/3474124.3474163\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Humans overcome the degradation effect of background noise on the speech by changing their vocal characteristics dynamically, which text-to-speech systems trained on clean speech cannot, resulting in degraded intelligibility. To improve the intelligibility of such system requires a large amount of speech samples, which is difficult to collect for various conditions like noise backgrounds and signal-to-noise ratios. This paper presents a noise dependent enhancement to text-to-speech, based on generative adversarial network training framework to generate intelligible speech in noise. The learning mechanism for the synthesizer network is inspired from the acoustic feedback humans use to nullify the effect of various background noises. The system thus trained is evaluated under cafeteria noise condition with two objective measures, which indicated improvement in intelligibility compared to models trained on clean speech across 3 SNRs. The proposed modification does not require any additional training data and can be applied to a variety of Deep Neural Networks that employ back-propagation algorithm for training.\",\"PeriodicalId\":144611,\"journal\":{\"name\":\"2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3474124.3474163\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3474124.3474163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Generative Adversarial Network based Training Framework for Robust TTS in Noisy Environment
Humans overcome the degradation effect of background noise on the speech by changing their vocal characteristics dynamically, which text-to-speech systems trained on clean speech cannot, resulting in degraded intelligibility. To improve the intelligibility of such system requires a large amount of speech samples, which is difficult to collect for various conditions like noise backgrounds and signal-to-noise ratios. This paper presents a noise dependent enhancement to text-to-speech, based on generative adversarial network training framework to generate intelligible speech in noise. The learning mechanism for the synthesizer network is inspired from the acoustic feedback humans use to nullify the effect of various background noises. The system thus trained is evaluated under cafeteria noise condition with two objective measures, which indicated improvement in intelligibility compared to models trained on clean speech across 3 SNRs. The proposed modification does not require any additional training data and can be applied to a variety of Deep Neural Networks that employ back-propagation algorithm for training.