基于神经网络的汉语文本-语音韵律与谱信息生成研究

[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing Pub Date : 1992-03-23 DOI:10.1109/ICASSP.1992.226124

Sin-Horng Chen, Shaw-Hwa Hwang, Chun-Yu Tsai

{"title":"基于神经网络的汉语文本-语音韵律与谱信息生成研究","authors":"Sin-Horng Chen, Shaw-Hwa Hwang, Chun-Yu Tsai","doi":"10.1109/ICASSP.1992.226124","DOIUrl":null,"url":null,"abstract":"A neural-network-based approach to generating prosodic and spectral information of syllables for Mandarin text-to-speech synthesis is studied. Some contextual features are first extracted from a given input text by text analysis and taken as input signals for synthesis. Then, six multilayer perceptrons are employed to generate pause duration, syllable duration, and pitch mean and shape of one- and two-syllable synthesis units, several reproduction templates of proper size are first generated for each synthesis unit of syllable approach. The objective is to generate spectral patterns of the syllable that can be directly concatenated to synthesize natural speech without further modification. The validity of this novel approach was examined by simulation using a database of sentential utterances recorded from TV news, reported by a single female announcer. Experimental results confirmed that this is a promising approach for Mandarin text-to-speech synthesis.<<ETX>>","PeriodicalId":163713,"journal":{"name":"[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1992-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"A first study on neural net based generation of prosodic and spectral information for Mandarin text-to-speech\",\"authors\":\"Sin-Horng Chen, Shaw-Hwa Hwang, Chun-Yu Tsai\",\"doi\":\"10.1109/ICASSP.1992.226124\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A neural-network-based approach to generating prosodic and spectral information of syllables for Mandarin text-to-speech synthesis is studied. Some contextual features are first extracted from a given input text by text analysis and taken as input signals for synthesis. Then, six multilayer perceptrons are employed to generate pause duration, syllable duration, and pitch mean and shape of one- and two-syllable synthesis units, several reproduction templates of proper size are first generated for each synthesis unit of syllable approach. The objective is to generate spectral patterns of the syllable that can be directly concatenated to synthesize natural speech without further modification. The validity of this novel approach was examined by simulation using a database of sentential utterances recorded from TV news, reported by a single female announcer. Experimental results confirmed that this is a promising approach for Mandarin text-to-speech synthesis.<<ETX>>\",\"PeriodicalId\":163713,\"journal\":{\"name\":\"[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1992-03-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.1992.226124\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.1992.226124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

研究了一种基于神经网络的汉语文本语音合成中音节韵律和谱信息的生成方法。首先通过文本分析从给定的输入文本中提取上下文特征，并将其作为输入信号进行合成。然后，利用6个多层感知器生成单音节和双音节合成单元的停顿时长、音节时长、音高均值和形状，并为每个音节法合成单元首先生成若干适当大小的复制模板。目标是生成音节的谱模式，可以直接连接起来合成自然语音，而无需进一步修改。这种新方法的有效性通过一个模拟数据库来检验，该数据库是由一位女播音员从电视新闻中记录的句子话语。实验结果证实，这是一种很有前途的中文文本-语音合成方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A first study on neural net based generation of prosodic and spectral information for Mandarin text-to-speech

A neural-network-based approach to generating prosodic and spectral information of syllables for Mandarin text-to-speech synthesis is studied. Some contextual features are first extracted from a given input text by text analysis and taken as input signals for synthesis. Then, six multilayer perceptrons are employed to generate pause duration, syllable duration, and pitch mean and shape of one- and two-syllable synthesis units, several reproduction templates of proper size are first generated for each synthesis unit of syllable approach. The objective is to generate spectral patterns of the syllable that can be directly concatenated to synthesize natural speech without further modification. The validity of this novel approach was examined by simulation using a database of sentential utterances recorded from TV news, reported by a single female announcer. Experimental results confirmed that this is a promising approach for Mandarin text-to-speech synthesis.<>

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing

自引率

0.00%

发文量