{"title":"Large vocabulary continuous Mandarin speech recognition using finite state machine","authors":"Yi-Cheng Pan, Chia-Hsing Yu, Lin-Shan Lee","doi":"10.1109/CHINSL.2004.1409572","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409572","url":null,"abstract":"The finite state transducer (FST), popularly used in the natural language processing (NLP) area to represent the grammar rules and the characteristics of a language, has been extensively used as the core in large vocabulary continuous speech recognition (LVCSR) in recent years. By means of FST, we can effectively compose the acoustic model, pronunciation lexicon, and language model to form a compact search space. In this paper, we present our approach to developing a LVCSR decoder using FST as the core. In addition, the traditional one-pass tree-copy search algorithm is also described for comparison in terms of speed, memory requirements and achieved character accuracy.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"331 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134100626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Shanghainese F/sub 0/ contours based on the command-response model","authors":"Wentao Gu, K. Hirose, H. Fujisaki","doi":"10.1109/CHINSL.2004.1409591","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409591","url":null,"abstract":"As one of the major Chinese dialects, Shanghainese is well known for its complex tone sandhi system. This paper applies the command-response model to represent F/sub 0/ contours of Shanghainese speech. Analysis-by-synthesis is conducted both on carrier sentences with monosyllabic target words and on isolated polysyllabic words, from which a set of appropriate tone command patterns is derived for words of different lengths and different initial citation tones. By incorporating the effects of tone coarticulation, word accentuation and phrase intonation, the model gives high accuracy of approximations to F/sub 0/ contours of Shanghainese utterances, and hence provides a more efficient means to quantitatively represent F/sub 0/ contours and to describe the tone sandhi system of Shanghainese than the traditional 5-level tone code system.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134223899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hearer model based stress prediction for Chinese TTS system","authors":"Guoping Hu, Qingfeng Liu, Yu Hu, Ren-Hua Wang","doi":"10.1109/CHINSL.2004.1409611","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409611","url":null,"abstract":"People often feel tired if they listen to synthesized speech for a long time. This is mainly because synthesized speech is too flat and never stresses the focus. Unlike traditional TTS research approaches of speaker simulation, the paper investigates stress prediction from the point of view of the hearer. An ideal hearer model is first proposed to predict the stress distribution based on the following hypothesis: people speak with limited stress effort and distribute the limited effort to ensure that the hearer can understand the speaker easily. Then, according to the limited research resource, we modify the ideal hearer model and present a practical model. Experiments show that the stress prediction achieves an acceptable rate of 87.36%.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134469165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Emotion recognition from Mandarin speech signals","authors":"T. Pao, Yu-Te Chen, Jun-Heng Yeh","doi":"10.1109/CHINSL.2004.1409646","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409646","url":null,"abstract":"In this paper, a Mandarin speech based emotion classification method is presented. Five primary human emotions including anger, boredom, happiness, neutral and sadness are investigated. In emotion classification of speech signals, the conventional features are statistics of fundamental frequency, loudness, duration and voice quality. However, the recognition accuracy of systems employing these features degrades substantially when more than two valence emotion categories are invoked. For speech emotion recognition, we select 16 LPC coefficients, 12 LPCC components, 16 LFPC components, 16 PLP coefficients, 20 MFCC components and jitter as the basic features to form the feature vector. A Mandarin corpus recorded by 12 non-professional speakers is employed. The recognizer presented in this paper is based on three recognition techniques: LDA, K-NN, and HMMs. Experimental results show that the selected features are robust and effective for emotion recognition, not only in the arousal dimension but also in the valence dimension.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116739197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An investigation into subspace rapid speaker adaptation","authors":"Michael Zhang, Jun Xu","doi":"10.1109/CHINSL.2004.1409639","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409639","url":null,"abstract":"Speaker adaptation is an essential part of any state-of-the-art automatic speech recognizer (ASR). Recently, more and more application requirements have appeared for embedded ASR. For these cases, a more compact speech model, subspace distribution clustering hidden Markov model (SDCHMM) is used instead of continuous density hidden Markov model (CDHMM). In previous studies on SDCHMM adaptation, the subspace Gaussian pools of SDCHMM are the parameters to be adjusted for speaker variations. Alternatively, we try to employ the link table parameters of SDCHMM, which defines the tying structure in subspaces, to model the inter-speaker mismatch, with the Gaussian parameters maintained. Since the variation range for the parameters is highly limited, this method is potentially faster than conventional Gaussian pools adaptation. A comparative study on a continuous digital dialing (CDD) task shows that when data is seriously insufficient, link table adaptation is more effective than conventional methods, with 17% relative improvement in utterance accuracy rate, compared to 14% improvement by previous Gaussian adaptation. However, further improvement with more data is limited. When data size is doubled, this method gave 21% improvement, compared to 30% improvement by the conventional method.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133254460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An information gain and grammar complexity based approach to attribute selection in speech enabled information retrieval dialogs","authors":"Haiping Li, Haixin Chai","doi":"10.1109/CHINSL.2004.1409657","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409657","url":null,"abstract":"An effective dialog driven method is required for today's speech enabled information retrieval systems, such as name dialers. Similar to the dynamic sales dialog for electronic commerce scenarios, information gain measure based approaches are widely used for attribute selection and dialog length reduction. However, for speech enabled information retrieval systems, another important factor influencing attribute selection is speech recognition accuracy. Too low accuracy results in a failed dialog. Recognition accuracy varies with many issues, including acoustic model performance and grammar complexity. The acoustic model is fixed for a whole dialog, while grammar is different for each interaction round, thereby grammar complexity influences the attribute selected for the next question. An approach combining both information gain measurement and grammar complexity is presented for a dynamic dialog driven system. Offline evaluations show that this approach can give a trade-off between the target of faster discrimination of the candidates for retrieval and higher recognition accuracy.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125559704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dependence of correct pronunciation of Chinese aspirated sounds on power during voice onset time","authors":"A. Hoshino, Akio Yasuda","doi":"10.1109/CHINSL.2004.1409601","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409601","url":null,"abstract":"The length of voice onset time (VOT) in uttering Chinese aspirated sounds, which are difficult for Japanese to pronounce, is an important factor in evaluating the quality of pronunciation. In this paper, both the length of the VOT and the power used during the VOT for 21 single-vowel syllables of six different Chinese aspirates were measured for 40 Japanese students and nine native speakers of Chinese. The quality of the students' pronunciation was evaluated using a hearing test judged by eight native Chinese. The results indicated that the correlation between the quality of the students' pronunciation and the power used in uttering a sound was greater than to the VOT within a certain range of VOT which varied for different syllables. Thus, we conclude that power is also an important factor in evaluating the quality of pronunciation.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129111723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy contour enhancement for noisy speech recognition","authors":"Tai-Hwei Hwang, Sen-Chia Chang","doi":"10.1109/CHINSL.2004.1409633","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409633","url":null,"abstract":"Environmental noise, known as an additive noise, not only corrupts the spectra of a speech signal but also blurs the shape of its energy contour. The corruption of the energy contour can distort the energy derived feature and degrade the pattern classification performance of noisy speech. To reduce the distortion of the energy feature, the energy bias in the energy contour has to be removed before the feature extraction. For this purpose, we propose two methods to estimate the noise energy; one is obtained from the speech inactive period, and one is from the noisy speech itself. The methods are evaluated by the connected digit recognition of TIDigits, in which the test speech is corrupted with white noise, babble, factory noise, and in-car noises. As shown in the experiments, the energy enhancement can provide an additional improvement when it is jointly applied with a spectral subtraction.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129937891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chinese-English mixed-lingual keyword spotting","authors":"Shan-Ruei You, Shih-Chieh Chien, Chih-Hsing Hsu, Ke-Shiu Chen, Jia-Jang Tu, Jeng-Shien Lin, Sen-Chia Chang","doi":"10.1109/CHINSL.2004.1409630","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409630","url":null,"abstract":"Base on our former experience in the \"ITRI 104 Auto Attendant System\" of using keyword spotting for Mandarin speech recognition (W.-C. Shieh et al., CCL Technical Journal, vol. 96), a Chinese-English mixed-lingual keyword spotting system, which caters for the Taiwanese speaking style, is presented. Detailed descriptions and discussions for developing the mixed-lingual auto attendant system are included, especially for solving different scoring scales in the decoding phase and the re-scoring phase for the two languages. In the decoding phase, we propose a bias-compensation method to make up the score-gap in the likelihood calculation of using Chinese and English acoustic models. To select the most probable result from the recognized hypotheses, a method is also presented of normalizing the combination scores when using different scoring mechanisms in the re-scoring phase.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"8 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117047038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MCE-based training of subspace distribution clustering HMM","authors":"Xiao-Bing Li, Lirong Dai, Ren-Hua Wang","doi":"10.1109/CHINSL.2004.1409599","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409599","url":null,"abstract":"For resource-limited platforms, the subspace distribution clustering hidden Markov model (SDCHMM) is better than the continuous density hidden Markov model (CDHMM) for its smaller storage and lower computations while maintaining a decent recognition performance. But the normal SDCHMM obtaining method does not ensure optimality in classifier design. In order to obtain an optimal classifier, a new SDCHMM training algorithm that adjusts the parameters of SDCHMM according to the minimum classification error (MCE) criterion is proposed in this paper. Our experimental results on TiDigits and RM tasks show the MCE-based SDCHMM training algorithm provides 15-80% word error rate reduction (WERR) compared with the normal SDCHMM that is converted from CDHMM.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121053330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}