{"title":"通过提高文本到语音合成的质量,实现更好的人机交互","authors":"V. R. Reddy, K. S. Rao","doi":"10.1109/IHCI.2012.6481857","DOIUrl":null,"url":null,"abstract":"In this paper we propose high quality prosody models for enhancing the quality of text-to-speech (TTS) synthesis for providing better human computer interaction. In this study prosody refers to duration and intonation patterns of the sequence of syllables. In this work, prosody models are developed using feedforward neural networks, and prosodic information is predicted from linguistic and production constraints of syllables. The prediction accuracy of the proposed neural network based prosody models is compared objectively with Classification and Regression Tree based prosody models used by Festival. Subjective listening tests are also performed to evaluate the quality of the synthesized speech generated by incorporating the predicted prosodic features. From the evaluation studies, it is observed that prediction accuracy is better for neural network models, compared to other models.","PeriodicalId":107245,"journal":{"name":"2012 4th International Conference on Intelligent Human Computer Interaction (IHCI)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Better human computer interaction by enhancing the quality of text-to-speech synthesis\",\"authors\":\"V. R. Reddy, K. S. Rao\",\"doi\":\"10.1109/IHCI.2012.6481857\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we propose high quality prosody models for enhancing the quality of text-to-speech (TTS) synthesis for providing better human computer interaction. In this study prosody refers to duration and intonation patterns of the sequence of syllables. In this work, prosody models are developed using feedforward neural networks, and prosodic information is predicted from linguistic and production constraints of syllables. The prediction accuracy of the proposed neural network based prosody models is compared objectively with Classification and Regression Tree based prosody models used by Festival. Subjective listening tests are also performed to evaluate the quality of the synthesized speech generated by incorporating the predicted prosodic features. From the evaluation studies, it is observed that prediction accuracy is better for neural network models, compared to other models.\",\"PeriodicalId\":107245,\"journal\":{\"name\":\"2012 4th International Conference on Intelligent Human Computer Interaction (IHCI)\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 4th International Conference on Intelligent Human Computer Interaction (IHCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IHCI.2012.6481857\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 4th International Conference on Intelligent Human Computer Interaction (IHCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IHCI.2012.6481857","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Better human computer interaction by enhancing the quality of text-to-speech synthesis
In this paper we propose high quality prosody models for enhancing the quality of text-to-speech (TTS) synthesis for providing better human computer interaction. In this study prosody refers to duration and intonation patterns of the sequence of syllables. In this work, prosody models are developed using feedforward neural networks, and prosodic information is predicted from linguistic and production constraints of syllables. The prediction accuracy of the proposed neural network based prosody models is compared objectively with Classification and Regression Tree based prosody models used by Festival. Subjective listening tests are also performed to evaluate the quality of the synthesized speech generated by incorporating the predicted prosodic features. From the evaluation studies, it is observed that prediction accuracy is better for neural network models, compared to other models.