N. J. Wang, Ching-Ho Tsai, Patrick Huang, Jia-Lin Shen
{"title":"Chinese large-vocabulary name recognition system using character description and syllable spelling recognition","authors":"N. J. Wang, Ching-Ho Tsai, Patrick Huang, Jia-Lin Shen","doi":"10.1109/CHINSL.2004.1409575","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409575","url":null,"abstract":"The large-vocabulary name recognition technique is one of the challenging tasks in the application of Chinese speech recognition technology. It can be applied on long-list automatic attendant systems and automatic directory assistance systems. A Chinese name has usually two to three characters with each character pronounced as a single syllable. It is a high perplexity task to recognize a word from a long-list of candidates, like more than three hundred thousand unique names in our experiments, given a very short utterance like one to two seconds of speech. Two novel approaches under an interactive framework are proposed in this paper to aid the recognition of a Chinese name: character description recognition (CDR) and syllable spelling recognition (SSR). Together with our robust finite-state recognizer given a graph-structured syllable lexicon for the full names, we achieved a very promising name recognition success rate, 94.5%, in our system-initiative dialogue system.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122409966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High quality harmonic excitation linear predictive speech coding at 2 kb/s","authors":"C. Bao, J. Lukasiak, C. Ritz","doi":"10.1109/CHINSL.2004.1409608","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409608","url":null,"abstract":"The paper presents a high quality harmonic excitation linear predictive (HE-LPC) speech coder operating at 2 kb/s based on a harmonic excitation model with two bands. The system incorporates novel features such as: combined pitch detection; residual harmonic matching voicing determination; extraction and interpolation of residual harmonic magnitudes. Subjective listening tests indicate that this coder has the same quality as that of the Federal Standard MELP (mixed excitation linear prediction) coder at 2.4 kb/s, whether the training database is from Chinese or English.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117344238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Xu, Yi Y. Liu, Yongsheng Yang, Pascale Fung, Z. Cao
{"title":"A system for Mandarin short phrase recognition on portable devices","authors":"Chao Xu, Yi Y. Liu, Yongsheng Yang, Pascale Fung, Z. Cao","doi":"10.1109/CHINSL.2004.1409628","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409628","url":null,"abstract":"With the proliferation of portable devices, speech recognition, especially name, address and command recognition, on these devices is a topic of growing relevance. A Mandarin short phrase recognition system is introduced in consideration of the limited resources and calculation ability of portable devices. A fixed-point front-end is developed, a discrete hidden Markov model is employed for acoustic modeling, and an SNR based likelihood weighting method is proposed to improve the noise robustness of the system. The memory size of the model set is 269 kB, the decoding time is 0.89 times of the speech duration, and the method for robustness gives a relative 15.2% word error rate reduction in a complex practical environment with both channel distortion and non-stationary noise present.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131092325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An improved 4 kbit/s CELP speech coding algorithm","authors":"Yanning Bai, C. Bao","doi":"10.1109/CHINSL.2004.1409609","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409609","url":null,"abstract":"The paper presents a 4 kbit/s CELP speech coder that utilizes the nonuniform and part-searching-area algebraic codebook technologies to overcome the insufficient number of signed pulses in a fixed codebook (FCB). The nonuniform algebraic codebook is based on the nonuniform statistical properties of the FCB. The part-searching-area utilizes the periodicity of the FCB excitation signal at low bit rates. The latter is only employed when the pitch delay is small enough. We also find that preserving the continuity of pitch is very important for voiced segments if these two technologies are used. So different pitch-detection methods are employed for voiced/unvoiced frames. Subjective and objective test results indicate that the qualities of reconstructed speech are improved, especially for female speakers.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123101765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Grapheme-to-phoneme conversion in Chinese TTS system","authors":"Honghui Dong, J. Tao, Bo Xu","doi":"10.1109/CHINSL.2004.1409612","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409612","url":null,"abstract":"Phonetization is an important component in Chinese TTS systems. However, the polyphonic characters make this problem more complex. The paper reports a study on the relation between Chinese characters and their pronunciation, proposes the solution to the disambiguation of polyphonic characters, a dictionary-based method, and a rules-based method. In the rules-based method, we use the statistical decision list method. The phonetization plan has been proved effective experimentally. Improvements in the accuracy of polyphone phonetization are mostly over 10%.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129629940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On analysis of eigenpitch in Mandarin Chinese","authors":"Jilei Tian, J. Nurminen","doi":"10.1109/CHINSL.2004.1409593","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409593","url":null,"abstract":"Prosody is an inherent supra-segmental feature of human speech that is being employed to express, e.g., attitude, emotion, intent and attention. Pitch is the most important feature among the prosodic information. For Mandarin Chinese speech, the pitch information is even more crucial because Mandarin is a tonal language in which the tone of each syllable is described by its pitch contour. In this paper, the concept of syllable-based eigenpitch is introduced and investigated using principal component analysis (PCA). The eigenpitch and the related eigenfeatures are analyzed, and it is shown that the tonal patterns are preserved in the eigenpitch representation. Furthermore, we show that the dimension of pitch in the eigenspace can be reduced while minimizing the energy loss of the original pitch contour. Finally, we briefly discuss the quantization properties of the eigenpitch representation. We also present experimental results obtained using a Mandarin speech database. They are in line with the theoretical reasoning and further prove the usefulness of the proposed pitch modeling technique.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125585562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The disambiguation strategies of semantic analysis in Chinese spoken dialogue system","authors":"Bei Liu, Limin Du","doi":"10.1109/CHINSL.2004.1409618","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409618","url":null,"abstract":"Semantic frame analysis is one of the most commonly used semantic analysis methods in Chinese spoken dialogue system research. And the two typical ambiguous structures commonly encountered in semantic analysis are relation-ambiguity and structural-ambiguity. According to the features of these two ambiguous structures, this paper puts forth the semantic PCFG (probabilistic context free grammar) model based disambiguation strategy to solve structural-ambiguity, and the expectation model (EM) based disambiguation strategy to solve relation-ambiguity. Efficient algorithms of the two methods are also provided. The experimental results show that applying these two disambiguation strategies can greatly improve the performance of language understanding in a base-line system. Especially, sentence accuracy is improved from 75.7% to 91.5%, and the three targets of semantic unit understanding rate-correction, recall, and precision are also improved by 10% on average.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125210804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A superposed prosodic model for Chinese text-to-speech synthesis","authors":"G. Chen, G. Bailly, Qingfeng Liu, Ren-Hua Wang","doi":"10.1109/CHINSL.2004.1409615","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409615","url":null,"abstract":"The paper presents the application of the trainable SFC superpositional prosodic model to Chinese. Within the SFC model, prosodic parameters (F0, syllabic lengthening) are interpreted as the superposition of overlapping multiparametric contours. These contours are associated with high-level prosodic features operating at different scopes, such as tones, stress, prosodic boundary, part of speech of words, etc. Each feature label corresponds to a metalinguistic function (morphological, lexical, syntactic, attitudinal, etc.) which is represented by a neural network. The observed contour is the sum of the outputs of the corresponding neural networks. An analysis-by-synthesis scheme is implemented for automatic learning. This model works well in the concatenation of neighbored units. The RMSE of F0 prediction is 2.34 st (referenced to 200 Hz), correlation is 0.86. Perceptual experiments show that the predicted prosody is quite appropriate and fluent.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129486777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating tonal information into Mandarin name recognition with different strategies","authors":"Dongsheng Luo, Xiang Xie, Jingming Kuang","doi":"10.1109/CHINSL.2004.1409637","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409637","url":null,"abstract":"Name recognition is a practical application of speech recognition technology. As Chinese is well known to be a tonal language, tonal information has an important influence on this task. In this paper we integrate tonal information into a speaker-independent Mandarin name recognizer, and two combination strategies: feature combination and posterior combination are investigated first. The recognizer is evaluated on an extremely challenging Mandarin name corpus, which includes 100 tonally confusing pairs. Although a significant improvement in the recognition accuracy can be achieved with either strategy, the system has a poor flexibility. Based on the analysis of the experimental results we propose a two-step process to improve the system performance further. It is shown that a maximal improvement of 29.96% in word accuracy can be achieved. At the same time the system has good flexibility with tonal information being integrated dynamically.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"12 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131433760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin-Shan Lee, Shun-Chuan Chen, Yuan Ho, Jia-fu Chen, Ming Li, T. Li
{"title":"An initial prototype system for Chinese spoken document understanding and organization for indexing/browsing and retrieval applications","authors":"Lin-Shan Lee, Shun-Chuan Chen, Yuan Ho, Jia-fu Chen, Ming Li, T. Li","doi":"10.1109/CHINSL.2004.1409653","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409653","url":null,"abstract":"The most attractive form of future network content will be multimedia. When voice information is included, it usually carries core concepts for the content. Thus, a spoken document associated with multimedia content can very possibly serve as the key for indexing/browsing and retrieval. However, unlike written documents, multimedia or voice information is very often just audio/video signals. They are very difficult to index, browse or retrieve, since users cannot go through each of them from the beginning to the end during browsing. A possible approach may be to segment the audio/video signals automatically into short paragraphs, each with a central concept or topic, and then automatically generate a title and/or a summary for each of these, in either speech or text form. The topics and central concepts described in the segmented short paragraphs may then be further analyzed and organized into graphic structures describing the relationships among these topics and central concepts. Hence, the multimedia content can be automatically indexed much more efficiently and browsed and retrieved by the user based on the title, summary and graphic structure. We refer to this as the understanding and organization of spoken documents. An initial prototype system for such functions, with broadcast news taken as the example multimedia content, is presented. The graphic structure used to describe the relationships among the topics and central concepts are 2-dimensional tree structures developed based on probabilistic latent semantic analysis.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115905254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}