{"title":"Evaluation of a Feature Compensation Approach Using High-Order Vector Taylor Series Approximation of an Explicit Distortion Modelon Aurora2, Aurora3, and Aurora4 Tasks","authors":"Jun Du, Qiang Huo, Yu Hu","doi":"10.1109/CHINSL.2008.ECP.32","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.32","url":null,"abstract":"In our previous work, a new feature compensation approach to robust speech recognition was proposed by using high-order vector Taylor series (HOVTS) approximation of an explicit model of distortions caused by additive noises, and evaluation results were reported on Aurora2 database. This paper extends the above approach to deal with both additive noises and convolutional distortions, and reports evaluation results on Aurora2, Aurora3, and Aurora4 tasks.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124396528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Prosodic Strength Calculation Method for Prosody Reduction Modeling","authors":"Honglei Cong, Zhiyong Wu, Lianhong Cai, H. Meng","doi":"10.1109/CHINSL.2008.ECP.25","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.25","url":null,"abstract":"To improve the naturalness of synthetic speech, prosody models in text-to-speech (TTS) system should be able to describe different prosody variations in natural speech. In this paper, prosody variation patterns behind the partial reduction phenomena are analyzed. In order to model the prosody reduction effect and incorporate it into the prosody model for speech synthesis, prosodic strength is introduced and a new prosodic strength calculation method is proposed. The method aims to model the sentence planning of prosody reduction and is based on the concept that the objective of prosodic strength should complete the planned target of the speech unit. The approach on how to integrate prosodic strength into speech synthesis system is also introduced. Experiments show that the estimated prosodic strength values by the proposed method have good correlations with both prosody structure and acoustic features.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117340303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HMM-Based Mixed-Language (Mandarin-English) Speech Synthesis","authors":"Yao Qian, Houwei Cao, F. Soong","doi":"10.1109/CHINSL.2008.ECP.15","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.15","url":null,"abstract":"English words or short phrases embedded in Mandarin utterances have become more common among bilingually educated people like college students in China. Similarly, it becomes highly desirable that TTS systems can synthesize mixed- language speech properly. Recently, we proposed an HMM-based bilingual TTS to synthesize a target language when only monolingual source language recording from a speaker is available. In this paper, we extend it to synthesize mixed- language sentences. A cross-language state mapping is first established between decision trees built from the English and Mandarin recordings of a bilingual speaker. Via the mapping, English words or phrases embedded in Mandarin sentences can then be synthesized. The bilingual state-mapping is extended to monolingual speaker to perform mixed-language synthesis. Perceptual test results show: (1) decent intelligibility, confirmed by an English word transcription accuracy of 86%; (2) good speech quality with an average MOS score of 3.2.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116530268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Entropy-Based Analysis of the Prosodic Features of Chinese Dialects","authors":"Raymond W. M. Ng, Tan Lee","doi":"10.1109/CHINSL.2008.ECP.28","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.28","url":null,"abstract":"In this paper, a novel approach is proposed to analyze prosodic features of four Chinese dialects: Wu, Cantonese, Min and Mandarin. The ultimate goal is to exploit these features in the task of automatic spoken language identification. Two entropy-based evaluation metrics are formulated to address the problems of data sparseness and lack of speakers. Different prosody-related acoustic features and their combinations are evaluated. FO, FO gradient and intensity are found to contain the most language-related information. Maximum language-related information are observed in multi-dimensional N-gram features with FO, FO gradient and syllable position in sentence. There are also some uncertain results that reveal the limitations of the proposed metrics.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122651694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Non-Target Region Information for Confidence Measure Based on Bayesian Information Criterion","authors":"Cong Liu, Yu Hu, Xiong-Guo Lei, Zhiguo Wang, Lirong Dai, Ren-Hua Wang","doi":"10.1109/CHINSL.2008.ECP.69","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.69","url":null,"abstract":"In this paper appropriate confidence measures (CMs) are investigated for Mandarin command word recognition, both in the so-called target region and non-target region, respectively. Here the target region refers to the recognized speech part of command word while the non-target region refers to the recognized silence part. It shows that exploiting extra information in the non-target region can effectively complement the traditional CM which usually focus on the target region. Furthermore, when analyzing the non-target region in a more theoretical way, where Bayesian information criterion (BIC) is employed to locate more precise boundary in the non-target region, even more improvement is achieved. In two different Mandarin telephone command word tasks, more than 20% relative reduction of equal error rate (EER) is obtained.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126877412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin","authors":"Fangzhou Liu, Huibin Jia, J. Tao","doi":"10.1109/CHINSL.2008.ECP.76","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.76","url":null,"abstract":"Modeling prosodic rhythm is of great importance for both speech synthesis and speech understanding, and it requires a large enough corpus with precise prosodic boundary labels. This paper proposes a maximum entropy (ME) based hierarchical model, which utilizes both text and acoustic features, to automatically label Mandarin prosodic boundaries. Results of comparative experiments show that, for the task of prosodic boundary detection, ME model obviously outperforms classification and regression tree (CART), and the bottom-up hierarchical framework is also significantly superior to the flat single-level framework.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120930367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng-Cheng Wang, Zhenhua Ling, Bu-Fan Zhang, Lirong Dai
{"title":"Multi-Layer F0 Modeling for HMM-Based Speech Synthesis","authors":"Cheng-Cheng Wang, Zhenhua Ling, Bu-Fan Zhang, Lirong Dai","doi":"10.1109/CHINSL.2008.ECP.44","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.44","url":null,"abstract":"This paper proposes a two-layer fundamental frequency (FO) modeling method for HMM-based parametric speech synthesis. The FO models are trained for each context- dependent phoneme in the conventional HMM-based speech synthesis system. Considering the super-segmental characteristics of FO features, an explicit syllable-layer FO model is introduced in this paper. At synthesis stage, the FO contour is generated by maximizing the combined likelihood functions of the phone-layer and syllable-layer FO models. The objective and subjective evaluation results in our experiments show that the proposed multi-layer FO modeling method can improve the performance of FO prediction for emotional speech synthesis.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114205274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tone Evaluation of Chinese Continuous Speech Based on Prosodic Words","authors":"Yi-Qian Pan, Si Wei, Ren-Hua Wang","doi":"10.1109/CHINSL.2008.ECP.77","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.77","url":null,"abstract":"Tonal evaluation of Chinese continuous speech plays an important role in Mandarin Chinese pronunciation test. In this paper, we introduce the Multi- Space Distribution Hidden Markov Model based on prosodic word. The results show that the performance of tonal syllable error rate can be reduced. For the non-standard Chinese Mandarin speech, the correlation between computer score and expert score was improved above 3.0% absolutely, compared with the baseline system without tonal pronunciation test.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116184341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis and Modeling of Affective Audio Visual Speech Based on PAD Emotion Space","authors":"Shen Zhang, Yingjin Xu, Jia Jia, Lianhong Cai","doi":"10.1109/CHINSL.2008.ECP.82","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.82","url":null,"abstract":"This paper analyzes acoustic and visual features for affective audio-visual speech based on PAD (Pleasure-Arousal- Dominance) emotion space. The selected acoustic features include FO maximum, FO minimum, duration and energy. A set of Partial Expression Parameters (PEP) is proposed as visual features to describe affective facial movement on talking face. This paper explores the connection between PAD emotion space and acoustic/visual features respectively. The variation of acoustic features is predicted by PAD values, and a PAD-PEP mapping function for facial expression synthesis is built. Experimental result shows that PAD could be properly applied in describing emotional state as well as predicting the acoustic/visual features for affective audiovisual speech synthesis.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"340 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115884653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dean Luo, N. Minematsu, Yutaka Yamauchi, K. Hirose
{"title":"Automatic Assessment of Language Proficiency through Shadowing","authors":"Dean Luo, N. Minematsu, Yutaka Yamauchi, K. Hirose","doi":"10.1109/CHINSL.2008.ECP.22","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.22","url":null,"abstract":"Shadowing is a practice that requires learners to shadow a presented native utterance as closely and quickly as possible. Learners' pronunciation in shadowing, especially in the case of beginners, often becomes inarticulate and corrupt. These features of shadowing make it very difficult to assess shadowing productions. In this paper, we investigate the automatic pronunciation scoring methods for shadowing. Three automatic scores have be proposed and compared with each other. Experiments show that good correlations are found between the automatic scores and human ratings or TOEIC overall proficiency scores.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131281591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}