Tao Zhou, Yuan Dong, Dezhi Huang, Wu Liu, Haila Wang
{"title":"A Three-Stage Text Normalization Strategy for Mandarin Text-to-Speech Systems","authors":"Tao Zhou, Yuan Dong, Dezhi Huang, Wu Liu, Haila Wang","doi":"10.1109/CHINSL.2008.ECP.43","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.43","url":null,"abstract":"Text normalization is an important component in mandarin Text-to-Speech system. This paper develops a taxonomy of Non-Standard Words (NSW's) based on a Large-scale Chinese corpus and proposes a three-stage text normalization strategy: Finite State Automata (FSA) for initial classification, Maximum Entropy (ME) Classifier & Rules for further classification and General Rules for standard word conversion. The three-stage approach achieves Precision of 96.02% in experiments, 5.21% higher than that of simple rule based approach and 2.21% higher than that of simple machine learning method. Experiments results show that the approach of three-stage disambiguation strategy for text normalization makes considerable improvement, and works well in real TTS system.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115775538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Prosody Boundary Labeling of Mandarin Using Both Text and Acoustic Information","authors":"Chongjia Ni, Wenju Liu, Bo Xu","doi":"10.1109/CHINSL.2008.ECP.100","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.100","url":null,"abstract":"Prosody is an important factor for a high quality text-to- speech (TTS) system. Prosody is often described with a hierarchical structure. So the generation of the hierarchical prosody structure is very important both in the corpus building and the real-time text analysis, but the prosody labeling procedure is laborious and time consuming. In this paper, an automatic prosody boundary label system is presented, in which the classification and regression tree (CART) framework is used. In this system, we build a prosody model using acoustic information and the text information based on large speech corpus with prosodic structure label (ASCCD). Experiments show this model can achieve prosody boundary detection 90.86% accuracy.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124274445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discriminative Output Coding Features for Speech Recognition","authors":"O. Dehzangi, B. Ma, Chng Eng Siong, Haizhou Li","doi":"10.1109/CHINSL.2008.ECP.34","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.34","url":null,"abstract":"This paper presents a novel approach of discriminative acoustic feature extraction for speech recognition using output coding technique. A high dimensional feature space for higher discriminative capability is constructed by expanding MFCC coefficients with polynomial expansion. In order to fit the discriminative features in the hidden Markov model structure of speech recognition, the high dimensional feature vectors are further projected into a low dimensional feature space using the output scores of a set of SVMs. Each of the SVMs is trained in one phone versus the rest manner so that each of the resulting feature dimensions can provide effective information to differ one phone from the others. The discriminative features have been evaluated in the speech recognition task of the TIMIT corpus, and 72.18% phone accuracy has been achieved.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121216925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation and Analysis of Minimum Phone Error Training and its Modified Versions for Large Vocabulary Mandarin Speech Recognition","authors":"Yung-Jen Cheng, Che-Kuang Lin, Lin-Shan Lee","doi":"10.1109/CHINSL.2008.ECP.51","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.51","url":null,"abstract":"This paper reports a detailed study on minimum phone error (MPE), minimum phone frame error (MPFE), and a physical-state level version of minimum Bayes risk (sMBR) training, as well as several modified versions of them, for transcription of large vocabulary Mandarin broadcast news. We found the results are quite different from these observed previously for English and Arabic broadcast news tasks[l], in particular the trends are different when different performance measures (word and character accuracies) are used. This makes the difference for Chinese language, for which character accuracy is usually more important, while word accuracy is commonly used for other languages. Modifications to these approaches tested here include considering the variable phone length and applying penalties to erroneous frames. They were shown to be able to significantly improve character accuracy in our experiments.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122850852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keiichiro Oura, Yoshihiko Nankaku, T. Toda, K. Tokuda, R. Maia, S. Sakai, Satoshi Nakamura
{"title":"Simultaneous Acoustic, Prosodic, and Phrasing Model Training for TTs Conversion Systems","authors":"Keiichiro Oura, Yoshihiko Nankaku, T. Toda, K. Tokuda, R. Maia, S. Sakai, Satoshi Nakamura","doi":"10.1109/CHINSL.2008.ECP.12","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.12","url":null,"abstract":"A new integrated model for simultaneous modeling of linguistic and acoustic models, and a training algorithm is proposed. Usually, text-to-speech (TTS) systems based on the hidden Markov model (HMM) consist of text analysis and speech synthesis modules. Linguistic and acoustic model training are performed independently using different training data sets. Integrated model parameters were simultaneously optimized by the proposed training algorithm. The derived algorithm optimizes two model parameter sets simultaneously. Therefore, the appropriate model is estimated because we can directly-formulate the TTS problem in which the speech waveform is generated from a word sequence. We conducted objective evaluation experiments using phrasing and prosodic models to evaluate the effectiveness of the proposed technique.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125993004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mandarin Language Understanding in Dialogue Context","authors":"Yushi Xu, Jingjing Liu, S. Seneff","doi":"10.1109/CHINSL.2008.ECP.40","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.40","url":null,"abstract":"In this paper we introduce Mandarin language understanding methods developed for spoken language applications. We describe a set of strategies to improve the parsing performance for Mandarin. We also discuss two context resolution techniques adopted to handle Chinese ellipsis in a practical Mandarin spoken dialogue system. Experimental evaluation verifies the effectiveness and efficiency of our proposed parsing enhancements, in terms of both parse coverage and speed. System evaluation with human subjects also verifies the effectiveness of our proposed approaches to speech understanding and context resolution in practical conversational systems.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130701635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hsuan-Sheng Chiu, Guan-Yu Chen, Chun-Jen Lee, Berlin Chen
{"title":"Position Information for Language Modeling in Speech Recognition","authors":"Hsuan-Sheng Chiu, Guan-Yu Chen, Chun-Jen Lee, Berlin Chen","doi":"10.1109/CHINSL.2008.ECP.37","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.37","url":null,"abstract":"This paper considers word position information for language modeling. For organized documents, such as technical papers or news reports, the composition and the word usage of articles of the same style are usually similar. Therefore, the documents can be separated into partitions consisting of identical rhetoric or topic styles by the literary structures, e.g., introductory remarks, related studies or events, elucidations of methodology or affairs, conclusions of the articles, and references, or footnotes of reporters. In this paper, we explore word position information and then propose two position- dependent language models for speech recognition. The structures and characteristics of these position-dependent language models were extensively investigated, while its performance was analyzed and verified by comparing it with the existing n-gram, mixture- and topic-based language models. The large vocabulary continuous speech recognition (LVCSR) experiments were conducted on the broadcast news transcription task. The preliminary results seem to indicate that the proposed position-dependent models are comparable to the mixture- and topic-based models.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129932015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Double Gauss Based Unsupervised Score Normalization in Speaker Verification","authors":"Wu Guo, Lirong Dai, Ren-Hua Wang","doi":"10.1109/CHINSL.2008.ECP.53","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.53","url":null,"abstract":"In text-independent speaker verification, unsupervised mode can improve system performance. In traditional systems, the speaker model is updated when a test speech has a score higher than a particular threshold; we call this unsupervised model training. In this paper, an unsupervised score normalization is proposed. A target speaker score Gauss and an impostor score Gauss are set up as a prior; the parameters of the impostor score model are updated using the test score. Then the test score is normalized by the new impostor score model. When the unsupervised score normalization, unsupervised model training and factor analysis are adopted in the NIST 2006 SRE core test, the EER of the system is 4.29%.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127740100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosodic Variation in Cantonese-English Code-Mixed Speech","authors":"Wentao Gu, Tan Lee, P. Ching","doi":"10.1109/CHINSL.2008.ECP.97","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.97","url":null,"abstract":"This study investigates the prosodic features of Cantonese-English code-mixed speech. It is found that the prosody of the matrix language is hardly altered, while the prosody of the embedded language is assimilated to that of the matrix language. That is, the rhythmic pattern is shifted towards syllable-timing, whereas the variations in the F0 pattern are mainly in the word-final syllable: for a stressed syllable the F0 contour turns flat, while for a post-tonic unstressed syllable the F0 contour falls more steeply than in monolingual English speech. Such F0 variations can be explained by the phonological interaction of English lexical stress and Cantonese lexical tone. In addition, the F0 of the embedded English word tends to become higher due to the embedding effect.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130158999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinhui Hu, H. Yamamoto, Jin-Song Zhang, K. Yasuda, Youzheng Wu, H. Kashioka
{"title":"Utilization of Huge Written Text Corpora for Conversational Speech Recognition","authors":"Xinhui Hu, H. Yamamoto, Jin-Song Zhang, K. Yasuda, Youzheng Wu, H. Kashioka","doi":"10.1109/CHINSL.2008.ECP.36","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.36","url":null,"abstract":"In this paper, we propose a new sentence selection method using large written text corpora to augment the language model of conversational speech recognition in order to resolve the insufficiency of in-domain training data coverage in conversational speech recognition. In the proposed method, the large written text corpora are clustered by an entropy-based method. Clusters similar to the target development set are selected automatically. Next, utterances are selected and mixed with the original conversational training corpus, and language models for conversational speech recognition are built. In our experiments, a different speech style test set that is not covered by original conversational training data is used for evaluation. The perplexity of the test set was reduced from 249.6 to 210.8, and the word recognition accuracy was improved by approximately 5% by using our method. Index Terms: data collection, training data coverage, language model, conversational speech recognition.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124954956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}