{"title":"文本到语音合成中的问题","authors":"M. Macchi","doi":"10.1109/IJSIS.1998.685467","DOIUrl":null,"url":null,"abstract":"The ultimate goal of text-to-speech synthesis is to convert ordinary orthographic text into an acoustic signal that is indistinguishable from human speech. Originally, synthesis systems were architected around a system of rules and models that were based on research on human language and speech production and perception processes. The quality of speech produced by such systems is inherently limited by the quality of the rules and the models. Given that our knowledge of human speech processes is still incomplete, the quality of text-to-speech is far from natural-sounding. Hence, today's interest in high quality speech for applications, in combination with advances in computer resource, has caused the focus to shift from rules and model-based methods to corpus-based methods that presumably bypass rules and models. For example, many systems now rely on large word pronunciation dictionaries instead of letter-to-phoneme rules and large prerecorded sound inventories instead of rules predicting the acoustic correlates of phonemes. Because of the need to analyze large amounts of data, this approach relies on automated techniques such as those used in automatic speech recognition.","PeriodicalId":289764,"journal":{"name":"Proceedings. IEEE International Joint Symposia on Intelligence and Systems (Cat. No.98EX174)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"Issues in text-to-speech synthesis\",\"authors\":\"M. Macchi\",\"doi\":\"10.1109/IJSIS.1998.685467\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The ultimate goal of text-to-speech synthesis is to convert ordinary orthographic text into an acoustic signal that is indistinguishable from human speech. Originally, synthesis systems were architected around a system of rules and models that were based on research on human language and speech production and perception processes. The quality of speech produced by such systems is inherently limited by the quality of the rules and the models. Given that our knowledge of human speech processes is still incomplete, the quality of text-to-speech is far from natural-sounding. Hence, today's interest in high quality speech for applications, in combination with advances in computer resource, has caused the focus to shift from rules and model-based methods to corpus-based methods that presumably bypass rules and models. For example, many systems now rely on large word pronunciation dictionaries instead of letter-to-phoneme rules and large prerecorded sound inventories instead of rules predicting the acoustic correlates of phonemes. Because of the need to analyze large amounts of data, this approach relies on automated techniques such as those used in automatic speech recognition.\",\"PeriodicalId\":289764,\"journal\":{\"name\":\"Proceedings. IEEE International Joint Symposia on Intelligence and Systems (Cat. No.98EX174)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1998-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. IEEE International Joint Symposia on Intelligence and Systems (Cat. No.98EX174)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJSIS.1998.685467\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE International Joint Symposia on Intelligence and Systems (Cat. No.98EX174)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJSIS.1998.685467","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The ultimate goal of text-to-speech synthesis is to convert ordinary orthographic text into an acoustic signal that is indistinguishable from human speech. Originally, synthesis systems were architected around a system of rules and models that were based on research on human language and speech production and perception processes. The quality of speech produced by such systems is inherently limited by the quality of the rules and the models. Given that our knowledge of human speech processes is still incomplete, the quality of text-to-speech is far from natural-sounding. Hence, today's interest in high quality speech for applications, in combination with advances in computer resource, has caused the focus to shift from rules and model-based methods to corpus-based methods that presumably bypass rules and models. For example, many systems now rely on large word pronunciation dictionaries instead of letter-to-phoneme rules and large prerecorded sound inventories instead of rules predicting the acoustic correlates of phonemes. Because of the need to analyze large amounts of data, this approach relies on automated techniques such as those used in automatic speech recognition.