M. Hwang, X. Lei, Tim Ng, I. Bulyko, Mari Ostendorf, A. Stolcke, Wen Wang, Jing Zheng, V. R. Gadde, M. Graciarena, M. Siu, Yan Huang
{"title":"Progress on Mandarin conversational telephone speech recognition","authors":"M. Hwang, X. Lei, Tim Ng, I. Bulyko, Mari Ostendorf, A. Stolcke, Wen Wang, Jing Zheng, V. R. Gadde, M. Graciarena, M. Siu, Yan Huang","doi":"10.1109/CHINSL.2004.1409571","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409571","url":null,"abstract":"Over the past decade, there has been good progress on English conversational telephone speech (CTS) recognition, built on the Switchboard and Fisher corpora. In this paper, we present our efforts on extending language-independent technologies into Mandarin CTS, as well as addressing language-dependent issues such as tone. We show the impact of each of the following factors: (a) simplified Mandarin phone set; (b) pitch features; (c) auto-retrieved Web texts for augmenting n-gram training; (d) speaker adaptive training; (e) maximum mutual information estimation; (f) decision-tree-based parameter sharing; (g) cross-word co-articulation modeling; and (h) combining MFCC and PLP decoding outputs using confusion networks. We have reduced the Chinese character error rate (CER) of the BBN-2003 development test set from 53.8% to 46.8% after (a)+(b)+(c)+(f)+(g) are combined. Further reduction in CER is anticipated after integrating all improvements.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133803696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trigram duration modeling in speech recognition","authors":"Yun Tang, Wenju Liu, Bo Xu","doi":"10.1109/CHINSL.2004.1409627","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409627","url":null,"abstract":"Rate of speech (ROS) is a very important factor in speech recognition. We present a new speech rate measurement method which first normalizes the duration of different acoustic units to a standard duration and then builds a trigram duration model to measure the speech rate of a sentence. We propose two methods based on the standard duration to compensate the influence introduced by speech rate variation in a data corpus and get 11% error rate reduction in Mandarin digit string recognition.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":" 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132075635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Taiwan Mandarin - does it remain homogeneous?","authors":"Hui-ju Hsu","doi":"10.1109/CHINSL.2004.1409603","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409603","url":null,"abstract":"Previous studies have shown discrepancies in tonal realizations between Guoyu and Putonghua. Early studies suggests Guoyu T3 is predominantly a falling tone and recent studies show Guoyu T2 is predominantly a dipping tone, in contrast to the long-considered default dipping and rising tone respectively. This study further explores the existence of regional varieties of Guoyu. Data are collected from Taipei and Taichung. Speakers read target sentences with 19 minimal pairs of final T2/T3 syllables being placed in sentence final. Results indicate regional differences of T2/T3 patterns. The result of Taipei speakers indicates a clear distinction of T2/T3 contour in that T2 is realized as a mid-dipping contour and T3 either a mid-dipping or a mid-failing contour, with the latter as the majority. However, in the Taichung dialect, this distinction disappeared. It is shown that Taichung T2 contour has changed from mid-dipping to mid-falling, merging with T3. This merger is statistically significant.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"417 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122850550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chung-Hsien Wu, C. Hsia, Jiun-Fu Chen, Te-Hsien Liu
{"title":"Variable-length unit selection using LSA-based syntactic structure cost","authors":"Chung-Hsien Wu, C. Hsia, Jiun-Fu Chen, Te-Hsien Liu","doi":"10.1109/CHINSL.2004.1409621","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409621","url":null,"abstract":"The paper introduces a variable-length unit selection method for concatenative speech synthesis based on a syntactic structure based on latent semantic analysis (LSA). First, a probabilistic context free grammar (PCFG) based parser is used to construct the syntactic structure of the input text sentence. Second, the synthesizer selects the candidate units for each node of the syntactic structure. LSA is then adopted to estimate the syntactic cost between the target unit and the candidate units in the database. Finally, the concatenation of units with minimum cost is selected using a dynamic programming algorithm. Experimental results show that variable-length unit selection based on syntactic structure outperforms the synthesizer that does not consider syntactic structure. Also, the LSA-based syntactic cost provides a better estimation of substitution cost than that calculated only from acoustic features.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122908402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speech research in telecommunications: a Bell-centric view","authors":"B. Juang","doi":"10.1109/CHINSL.2004.1409566","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409566","url":null,"abstract":"Summary form only given. Speech research aimed at developing technologies to enhance telecommunications has produced many remarkable results in the past five decades. In various branches of the field, many breakthrough technologies have been brought about due to courageous paradigm shifts advocated by a few. In this paper, we highlight the progress in speech processing technologies, particularly from an historical perspective as seen from Bell Laboratories, and point out these paradigm shifts in the hope to inspire more technical breakthroughs.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125434974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Focus and intonational phrase boundary in standard Chinese","authors":"Yiya Chen","doi":"10.1109/CHINSL.2004.1409581","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409581","url":null,"abstract":"This paper reports results of an experiment investigating the relation of focus and prosodic boundary. We tested the hypothesis that focus in standard Chinese introduces an intonational phrase (IP) boundary before a focused constituent by examining the durational adjustment of syllables in different prosodic positions (i.e. IP initial versus IP medial) and focus conditions (i.e. focused versus unfocused). Results show that under both focus conditions, IP initial onset was significantly longer than IP medial onset but little difference was observed in rhyme duration. Focus, however, tended to induce lengthening more consistently in rhyme than in onset in both prosodic positions. Furthermore, the magnitudes of lengthening on onset and rhyme tended to be comparable in terms of their percentage of lengthening. This suggests that the effect of prosodic position on segment duration is localized and restricted to onset while the effect of focus is relatively more global and spans over the whole focused constituent. We also found that an IP initial unfocused syllable differed significantly from an IP medial focused syllable in both onset and rhyme duration. We thus conclude that there is no durational evidence that focus inserts an IP boundary to the left edge of a focused constituent in standard Chinese.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129218336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unseen handset mismatch compensation based on feature/model-space a priori knowledge interpolation for robust speaker recognition","authors":"Jyh-Her Yang, Y. Liao","doi":"10.1109/CHINSL.2004.1409587","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409587","url":null,"abstract":"The unseen but mismatched handset is the major source of performance degradation for speaker recognition in the telecommunication environment. In this paper, an unseen handset characteristics estimation method based on a priori knowledge interpolation (AKI) is proposed. AKI could be applied in both the feature and model space to interpolate the feature and model transformation functions measured using stochastic matching (SM) and maximum likelihood linear regression (MLLR), respectively. Cross-validation experimental results on the HTIMIT database showed that the average speaker recognition rate could be improved from 59.6%/57.8% to 73.8%/66.8% for seen/unseen handsets. It is therefore a promising method for robust speaker recognition.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114968094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-driven temporal filters based on maximum mutual information for robust features in speech recognition","authors":"Yung-Sheng Huang, J. Hung","doi":"10.1109/CHINSL.2004.1409597","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409597","url":null,"abstract":"Linear discriminant analysis (LDA), principal component analysis (PCA) and minimum classification error (MCE) have been used to derive data-driven temporal filters in order to improve the robustness of speech features for speech recognition. In this paper, the criterion of maximum mutual information (MMI) is proposed for constructing the temporal filters, and detailed comparative analysis among these various approaches are presented and discussed. Experimental results show that the MMI-derived temporal filters significantly improve the recognition performance of the original mel frequency cepstrum coefficients (MFCC) features compared to LDA/PCA/MCE-derived filters. Also, while the MMI-derived filters are combined with the conventional temporal filters, cepstral mean and variance normalization (CMVN), the recognition performance can be further improved.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121460427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust speaker recognition integrating pitch and Wiener filter","authors":"Junmei Bai, Rong Zheng, Bo Xu, Shuwu Zhang","doi":"10.1109/CHINSL.2004.1409588","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409588","url":null,"abstract":"Speaker recognition (SR) obtains excellent results in clean speech. But noise or channel mismatch causes significant performance degradation in practical appliances. The paper focuses on resolving those problems in robust and efficient speaker identification (SI) in noisy environments. And it mainly contributes in two areas: signal processing based on Wiener filtering and speaker features integration of pitch and mel-frequency cepstrum coefficients (MFCC). It is shown in the experimental results on the YOHO corpus that the Wiener filter is an efficient front-end processing technique and pitch is a robust feature for SR in noisy environments.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131872545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language identification using discriminative weighted language models","authors":"Shizhen Wang, Jia Liu, Runsheng Liu","doi":"10.1109/CHINSL.2004.1409584","DOIUrl":"https://doi.org/10.1109/CHINSL.2004.1409584","url":null,"abstract":"In this paper, discriminative weighted language models are proposed to better distinguish between similar languages. Through parallel phone recognizers followed by language modeling (PPRLM) system in the first stage, two best candidates are hypothesized and then processed using discriminative language models. Experimental results show that, compared with the traditional one-pass language identification (LID) systems, the proposed two-pass method can greatly improve the performance without considerably increasing the computational costs. Tested on the evaluation set of the CallFriend corpus, the final system achieved an error rate of 14.90% on the 30s 12-way close-set task.","PeriodicalId":212562,"journal":{"name":"2004 International Symposium on Chinese Spoken Language Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133973358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}