{"title":"印度语言多语种语音合成器的性能评价与比较","authors":"M. Jeeva, B. Ramani, P. Vijayalakshmi","doi":"10.1109/ICRTIT.2013.6844268","DOIUrl":null,"url":null,"abstract":"Given an input text, a text-to-speech (TTS) system is expected to produce a speech signal that is highly intelligible to human listener. State-of-the art synthesis approaches are: unit selection-based concatenative speech synthesis (USS) and hidden Markov model (HMM)-based speech synthesis (HTS). In USS approach, pre-recorded speech units are selected according to the given text and concatenated to produce synthetic speech whereas in HTS approach, features are extracted from the speech units and the context dependent HMMs are trained for these units. These models are concatenated to form sentence HMMs, which synthesize speech for the given text, by extracting features from them and passing it through corresponding source-system filters. For Indian languages, building a speech synthesizer for each language is laborious. In this work, monolingual and multilingual speech synthesizers are developed in the state-of-the-art approaches and the performances are compared for both the systems. Based on the acoustic similarities across Indian languages, a common phoneset and a question set is derived for four of the Indian languages namely, Tamil, Telugu, Malayalam, and Hindi. The performance of the synthesizers developed are evaluated using mean opinion score (MOS) derived from the listeners. The average MOS ranges from 2.57 to 3.88 for the monolingual and multilingual systems.","PeriodicalId":113531,"journal":{"name":"2013 International Conference on Recent Trends in Information Technology (ICRTIT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Performance evaluation and comparison of multilingual speech synthesizers for Indian languages\",\"authors\":\"M. Jeeva, B. Ramani, P. Vijayalakshmi\",\"doi\":\"10.1109/ICRTIT.2013.6844268\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given an input text, a text-to-speech (TTS) system is expected to produce a speech signal that is highly intelligible to human listener. State-of-the art synthesis approaches are: unit selection-based concatenative speech synthesis (USS) and hidden Markov model (HMM)-based speech synthesis (HTS). In USS approach, pre-recorded speech units are selected according to the given text and concatenated to produce synthetic speech whereas in HTS approach, features are extracted from the speech units and the context dependent HMMs are trained for these units. These models are concatenated to form sentence HMMs, which synthesize speech for the given text, by extracting features from them and passing it through corresponding source-system filters. For Indian languages, building a speech synthesizer for each language is laborious. In this work, monolingual and multilingual speech synthesizers are developed in the state-of-the-art approaches and the performances are compared for both the systems. Based on the acoustic similarities across Indian languages, a common phoneset and a question set is derived for four of the Indian languages namely, Tamil, Telugu, Malayalam, and Hindi. The performance of the synthesizers developed are evaluated using mean opinion score (MOS) derived from the listeners. The average MOS ranges from 2.57 to 3.88 for the monolingual and multilingual systems.\",\"PeriodicalId\":113531,\"journal\":{\"name\":\"2013 International Conference on Recent Trends in Information Technology (ICRTIT)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Conference on Recent Trends in Information Technology (ICRTIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICRTIT.2013.6844268\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Recent Trends in Information Technology (ICRTIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICRTIT.2013.6844268","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance evaluation and comparison of multilingual speech synthesizers for Indian languages
Given an input text, a text-to-speech (TTS) system is expected to produce a speech signal that is highly intelligible to human listener. State-of-the art synthesis approaches are: unit selection-based concatenative speech synthesis (USS) and hidden Markov model (HMM)-based speech synthesis (HTS). In USS approach, pre-recorded speech units are selected according to the given text and concatenated to produce synthetic speech whereas in HTS approach, features are extracted from the speech units and the context dependent HMMs are trained for these units. These models are concatenated to form sentence HMMs, which synthesize speech for the given text, by extracting features from them and passing it through corresponding source-system filters. For Indian languages, building a speech synthesizer for each language is laborious. In this work, monolingual and multilingual speech synthesizers are developed in the state-of-the-art approaches and the performances are compared for both the systems. Based on the acoustic similarities across Indian languages, a common phoneset and a question set is derived for four of the Indian languages namely, Tamil, Telugu, Malayalam, and Hindi. The performance of the synthesizers developed are evaluated using mean opinion score (MOS) derived from the listeners. The average MOS ranges from 2.57 to 3.88 for the monolingual and multilingual systems.