Bao Pang, Jun Teng, Qingyang Xu, Yong Song, Xianfeng Yuan, Yibin Li
{"title":"Chinese personalised text-to-speech synthesis for robot human–machine interaction","authors":"Bao Pang, Jun Teng, Qingyang Xu, Yong Song, Xianfeng Yuan, Yibin Li","doi":"10.1049/csy2.12098","DOIUrl":null,"url":null,"abstract":"<p>Speech interaction is an important means of robot interaction. With the rapid development of deep learning, end-to-end speech synthesis methods based on this technique have gradually become mainstream. Chinese deep learning-based speech synthesis techniques suffer from problems such as unstable synthesised speech, poor naturalness and poor personalised speech synthesis, which do not satisfy some practical application scenarios. Hence, an F-MelGAN model is adopted to improve the performance of Chinese speech synthesis. A post-processing network is used to refine the Mel-spectrum predicted by the decoder and alleviate the Mel-spectrum distortion phenomenon. A phoneme-level and sentence-level combined module is proposed to model the personalised style of speakers. A combination of an acoustic conditioning network, speaker encoder network GCNet and feedback-constrained training is proposed to solve the problem of poor personalised speech synthesis and achieve personalised speech customisation in Chinese. Experimental results show that the whole model can generate high-quality speech with high speaker similarity for both speakers that appear in the training process and speakers that never appear in the training process.</p>","PeriodicalId":34110,"journal":{"name":"IET Cybersystems and Robotics","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.12098","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Cybersystems and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/csy2.12098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Speech interaction is an important means of robot interaction. With the rapid development of deep learning, end-to-end speech synthesis methods based on this technique have gradually become mainstream. Chinese deep learning-based speech synthesis techniques suffer from problems such as unstable synthesised speech, poor naturalness and poor personalised speech synthesis, which do not satisfy some practical application scenarios. Hence, an F-MelGAN model is adopted to improve the performance of Chinese speech synthesis. A post-processing network is used to refine the Mel-spectrum predicted by the decoder and alleviate the Mel-spectrum distortion phenomenon. A phoneme-level and sentence-level combined module is proposed to model the personalised style of speakers. A combination of an acoustic conditioning network, speaker encoder network GCNet and feedback-constrained training is proposed to solve the problem of poor personalised speech synthesis and achieve personalised speech customisation in Chinese. Experimental results show that the whole model can generate high-quality speech with high speaker similarity for both speakers that appear in the training process and speakers that never appear in the training process.