{"title":"基于神经网络的汉语文本-语音韵律与谱信息生成研究","authors":"Sin-Horng Chen, Shaw-Hwa Hwang, Chun-Yu Tsai","doi":"10.1109/ICASSP.1992.226124","DOIUrl":null,"url":null,"abstract":"A neural-network-based approach to generating prosodic and spectral information of syllables for Mandarin text-to-speech synthesis is studied. Some contextual features are first extracted from a given input text by text analysis and taken as input signals for synthesis. Then, six multilayer perceptrons are employed to generate pause duration, syllable duration, and pitch mean and shape of one- and two-syllable synthesis units, several reproduction templates of proper size are first generated for each synthesis unit of syllable approach. The objective is to generate spectral patterns of the syllable that can be directly concatenated to synthesize natural speech without further modification. The validity of this novel approach was examined by simulation using a database of sentential utterances recorded from TV news, reported by a single female announcer. Experimental results confirmed that this is a promising approach for Mandarin text-to-speech synthesis.<<ETX>>","PeriodicalId":163713,"journal":{"name":"[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1992-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"A first study on neural net based generation of prosodic and spectral information for Mandarin text-to-speech\",\"authors\":\"Sin-Horng Chen, Shaw-Hwa Hwang, Chun-Yu Tsai\",\"doi\":\"10.1109/ICASSP.1992.226124\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A neural-network-based approach to generating prosodic and spectral information of syllables for Mandarin text-to-speech synthesis is studied. Some contextual features are first extracted from a given input text by text analysis and taken as input signals for synthesis. Then, six multilayer perceptrons are employed to generate pause duration, syllable duration, and pitch mean and shape of one- and two-syllable synthesis units, several reproduction templates of proper size are first generated for each synthesis unit of syllable approach. The objective is to generate spectral patterns of the syllable that can be directly concatenated to synthesize natural speech without further modification. The validity of this novel approach was examined by simulation using a database of sentential utterances recorded from TV news, reported by a single female announcer. Experimental results confirmed that this is a promising approach for Mandarin text-to-speech synthesis.<<ETX>>\",\"PeriodicalId\":163713,\"journal\":{\"name\":\"[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1992-03-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.1992.226124\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.1992.226124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A first study on neural net based generation of prosodic and spectral information for Mandarin text-to-speech
A neural-network-based approach to generating prosodic and spectral information of syllables for Mandarin text-to-speech synthesis is studied. Some contextual features are first extracted from a given input text by text analysis and taken as input signals for synthesis. Then, six multilayer perceptrons are employed to generate pause duration, syllable duration, and pitch mean and shape of one- and two-syllable synthesis units, several reproduction templates of proper size are first generated for each synthesis unit of syllable approach. The objective is to generate spectral patterns of the syllable that can be directly concatenated to synthesize natural speech without further modification. The validity of this novel approach was examined by simulation using a database of sentential utterances recorded from TV news, reported by a single female announcer. Experimental results confirmed that this is a promising approach for Mandarin text-to-speech synthesis.<>