{"title":"An Investigation of Phonological Feature Systems Used in Detection-Based ASR","authors":"I-Fan Chen, H. Wang","doi":"10.1109/CHINSL.2008.ECP.38","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.38","url":null,"abstract":"In this paper, we study the effect of using different phonological feature sets for detection-based automatic speech recognition in phone recognition tasks. Three phonological feature sets derived from different underlying phonological theories are investigated. Our experiments were conducted on the TIMIT database. By comparing the oracle phone recognition results achieved by assuming that all the phonological features are correctly detected based on each feature set, we show that selecting an appropriate phonological feature set is crucial to the performance of detection-based ASR. The highly accurate oracle phone recognition results show that the performance of the CRF-based backend, which is commonly used in detection-based ASR, is very satisfactory. Comparison of the oracle phone recognition results and the real phone recognition results indicates that investigation of high-accuracy front-end detectors is a key issue in improving the performance of detection-based ASR.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"32-33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123635408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosody Modification on Mixed-Language Speech Synthesis","authors":"Yi Zhang, J. Tao","doi":"10.1109/CHINSL.2008.ECP.75","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.75","url":null,"abstract":"This paper proposes a method to generate natural prosody parameters in Chinese and English mixed-language speech synthesis system which is based on separate Chinese, English, and a small bilingual corpus. Prosodic assimilation of English words to Chinese contexts can be found by observing the bilingual corpus. The most obvious assimilation characteristics are the wider pitch range and the longer duration. A prosody modification model based on this observation is proposed to modify mono-lingual prosody parameters to adapt for mixed-lingual environment. Experiments have proved that more natural mixed-lingual prosody can be generated with our model.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122644675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved Linear Discriminant Analysis Considering Empirical Pairwise Classification Error Rates","authors":"Hung-Shin Lee, Berlin Chen","doi":"10.1109/CHINSL.2008.ECP.49","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.49","url":null,"abstract":"Linear discriminant analysis (LDA) is designed to seek a linear transformation that projects a data set into a lower-dimensional feature space for maximum class geometrical separability. LDA cannot always guarantee better classification accuracy, since its formulation is not in light of the properties of the classifiers, such as the automatic speech recognizer (ASR). In this paper, the relationship between the empirical classification error rates and the Mahalanobis distances of the respective class pairs of speech features is investigated, and based on this, a novel reformulation of the LDA criterion, distance-error coupled LDA (DE-LDA), is proposed. One notable characteristic of DE-LDA is that it can modulate the contribution on the between-class scatter from each class pair through the use of an empirical error function, while preserving the lightweight solvability of LDA. Experiment results seem to demonstrate that DE-LDA yields moderate improvements over LDA on the LVCSR task.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"366 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122851291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Noise Reduction Based Random Matrix Theory","authors":"Xugang Lu, Shigeki Matsuda, Tohru Shimizu, Satoshi Nakamura","doi":"10.1109/CHINSL.2008.ECP.83","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.83","url":null,"abstract":"In speech enhancement literature, the signal subspace based method gains a lot of attention because of its simplicity in analytical formulations. The original idea in this method is based on the assumption that clean speech signal occupies a certain low dimensional space, while the noise signal which is a white additive noise spread the whole observation space. In this method, accurate estimation of the noise power (or variance) is required. However, in real applications, the noise power can only be estimated with some degree of uncertainty. This uncertainty will degrade the signal subspace based speech enhancement algorithms, especially in heavy noisy situations since it does not take this uncertainty into consideration. In this study, we took the uncertainty of the estimation of noise power into consideration by using the statistical property of noise based on random matrix theory. The noise statistical property (eigenvalue distribution) was analytically formulated based on the maximum and minimum eigenvalues of the noise random matrix. Based on the statistical property of the eigenvalues of noise, we reduced the part contributed by noise from the covariance matrix of noisy speech. We tested our method for speech enhancement using AURORA-2J speech corpus. Our initial experiments showed that the proposed method performed better than the traditional signal subspace based speech enhancement method.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124583453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling and Generating Tone Contour with Phrase Intonation for Mandarin Chinese Speech","authors":"Zhizheng Wu, Yao Qian, F. Soong, Bo Zhang","doi":"10.1109/CHINSL.2008.ECP.42","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.42","url":null,"abstract":"This paper models F0 curves with discrete cosine transform (DCT) representations on both syllable-level tone and phrase-level intonation for Chinese Mandarin speech. Decision trees growing with maximum likelihood (ML) and stopping with minimum description length (MDL) are used to cluster very rich context-dependent DCT models into generalized ones to predict unseen contexts in test robustly. Additionally, we propose to generate Mandarin tone contours by jointly optimizing FO contours of syllable and phrase in ML sense. Experimental results on speaker-dependent continuous and speaker-independent isolated speech corpora show that the proposed approach can be able to generate FO contour with high correlation coefficients of 0.92 and 0.82 respectively, measured between the original and generated F0.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128884418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosody Study with Context-Dependent Acoustic Models","authors":"Yue-Ning Hu, Min Chu","doi":"10.1109/CHINSL.2008.ECP.26","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.26","url":null,"abstract":"In this paper, we propose to study prosody with context-dependent acoustic models. We find that we can achieve better resolution on a specific aspect by training CDM with certain focus. For the tone recognition task, CDM with focus on tones should be used and it achieves 15.2% relative error reduction, when comparing with the traditional tri-phone models. For detecting prosody boundaries, CDM with focus on position should be used and the accuracy of prosodic word is 92.2%. CDMs are also used to visualize the f0 patterns of sentences with give contextual information. Such patterns are helpful to understand the interaction among contextual factors. Overall, CDMs are useful data source for various prosody studies.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122320789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multipitch Detection Based on Weighted Summary Correlogram","authors":"Xueliang Zhang, Wenju Liu, Peng Li, Bo Xu","doi":"10.1109/CHINSL.2008.ECP.102","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.102","url":null,"abstract":"In this paper, we introduce a multipitch detection algorithm which is based on weighted summary correlogram. The weight is described as a conditional probability which models the relationship between fundamental frequency (FO) of periodic sound and response frequency of its dominated channels. Modified by this weight, SACF obtains more robustness to noise and to sub-harmonic error. The proposed algorithm can be used to track single or multiple pitches under noisy environment. Its performance is evaluated on 100 mixed sounds which comprise 10 voiced speeches and 10 different kinds of noises. The results show that our model has better performance than existing algorithms.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127375755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Phone Recognizer based MLLR Speaker Recognition","authors":"Eryu Wang, Wu Guo, Lirong Dai","doi":"10.1109/CHINSL.2008.ECP.91","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.91","url":null,"abstract":"The method that uses maximum-likelihood linear regression (MLLR) adaptation transformation as features for support vector machine (SVM) has been adopted in recent NIST Speaker Recognition Evaluation (SRE). It is attractive because it makes use of high-level information about the speakers, and it can complement the standard GMM-UBM system. The performance of the system will be affected by the phone recognizer, especially in multi-lingual contexts. In this paper, we use a multi language phone recognizer based MLLR-SVM system, which can deal with the language phone recognizer problem. This system is defined as parallel phone recognizer-MLLR (PPR-MLLR). It has simpler framework than existing MLLR methods and can achieve better performance. In the NIST SRE 06 1 conv4w-1 conv4w task, the system can achieve an EER of 5.44%. Furthermore, we can achieve an EER of 4.20% which is almost a 20% system performance improvement when combined with the cepstral GMM-UBM system.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116555862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speaker Recognition using a Kind of Novel Phonotactic Information","authors":"Xiang Zhang, Xiang Xiao, Haipeng Wang, Hongbin Suo, Qingwei Zhao, Yonghong Yan","doi":"10.1109/CHINSL.2008.ECP.94","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.94","url":null,"abstract":"In this paper, we present a new modeling approach for speaker recognition, which uses a kind of novel phonotactic information as the feature for S VM modeling. Gaussian mixture models (GMMs) have been proven extremely successful for text- independent speaker recognition. The GMM universal background model (UBM) is a speaker-independent model, each component of which can be considered to be modeling some underlying phonetic sounds. Thus, the UBM can be regarded to characterize a speaker-independent voice. We assume that the utterances from different speakers should get different average posterior probabilities on the same Gaussian component of the UBM, and the supervector composed of the average posterior probabilities on all components of the UBM for each utterance should be discriminative. We use these supervectors as the features for SVM based speaker recognition. Experiment results show that the proposed approach demonstrates comparable performance with the state-of-the-art systems on NIST 2006 SRE corpus. Fusion results are also presented.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116645878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muyun Yang, Shujie Liu, Lei Wang, Sheng Li, Jufeng Li, T. Zhao, Haoliang Qi
{"title":"An EMD Based Approach to Transliteration Unit Alignment between English and Chinese","authors":"Muyun Yang, Shujie Liu, Lei Wang, Sheng Li, Jufeng Li, T. Zhao, Haoliang Qi","doi":"10.1109/CHINSL.2008.ECP.81","DOIUrl":"https://doi.org/10.1109/CHINSL.2008.ECP.81","url":null,"abstract":"Machine transliteration is to translate the proper nouns in the source language according to its pronunciation into the target language. Recent orthographical based approach has improved the performance of machine translation significantly. Focusing on the transliteration unit alignment that provides a fundamental parameter for the model, this paper adopts a semi-supervised EMD approach-applying discriminative model over the original EM results. The experiment results prove it is substantial to the improvement of the transliteration performance.","PeriodicalId":291958,"journal":{"name":"2008 6th International Symposium on Chinese Spoken Language Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134176379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}