{"title":"A study on cepstral sub-band normalization for robust ASR","authors":"Syu-Siang Wang, J. Hung, Yu Tsao","doi":"10.1109/ISCSLP.2012.6423484","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423484","url":null,"abstract":"In this paper, we propose a cepstral subband normalization (CSN) approach for robust speech recognition. The CSN approach first applies the discrete wavelet transform (DWT) to decompose the original cepstral feature sequence into low and high frequency band (LFB and HFB) parts. Then, CSN normalizes the LFB components and zeros out the HFB components. Finally, an inverse DWT is applied on LFB and HFB components to form the normalized cepstral features. When using the Haar functions as the DWT bases, the calculation of CSN can be processed efficiently with a 50% reduction on the amount of feature components. In addition, our experimental results on the Aurora-2 task show that CSN outperforms the conventional cepstral mean subtraction (CMS), cepstral mean and variance normalization (CMVN), and histogram equalization (HEQ). We also integrate CSN with advanced frontend (AFE) for feature extraction. Experimental results indicate that the integrated AFE+CSN achieves notable improvements over the original AFE. The simple calculation, compact in form, and effective noise robustness properties enable CSN to perform suitably for mobile applications.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125167434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical modification based post-filtering technique for HMM-based speech synthesis","authors":"Zhengqi Wen, J. Tao, Hao Che","doi":"10.1109/ISCSLP.2012.6423456","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423456","url":null,"abstract":"The speech generated from hidden Markov model (HMM)-based speech synthesis systems (HTS) is suffered from over-smoothing problem which is due to statistical modeling. This paper will focus on post-filtering technique based on statistical modification for the generated speech parameters. The marginal statistics of parameters' trajectory, such as mean, variance, skewness and kurtosis are adjusted according to the values generated from the HTS system. This technique is compared with global variance (GV)-based speech generation algorithm. The listening test showed that the post-filtering technique considering the mean and variance could generate almost equal result with GV model. When further considering the modification of skewness and kurtosis, the quality of generated speech has been improved.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123733291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speaker-ensemble hidden Markov modeling for automatic speech recognition","authors":"Guoli Ye, B. Mak","doi":"10.1109/ISCSLP.2012.6423532","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423532","url":null,"abstract":"This paper proposes a new hidden Makov model (HMM) which we call speaker-ensemble HMM (SE-HMM). An SE-HMM is a multi-path HMM in which each path is an HMM constructed from the training data of a different speaker. SE-HMM may be considered a form of template-based acoustic model where speaker-specific acoustic templates are compressed statistically into speaker-specific HMMs. However, one has the flexibility of building SE-HMM at various level of compression: SE-HMM may be built for a triphone state, a triphone, a whole utterance, or other convenient phonetic units. As a result, SE-HMM contains more details than conventional HMM, but is much smaller than common template-based acoustic models. Furthermore, the construction of SE-HMM is simple, and since it is still an HMM, its construction and computation is well supported by common HMM toolkits such as HTK. The proposed SE-HMM was evaluated on Resource Management and Wall Street Journal tasks, and it consistently gives better word recognition results than conventional HMM.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126236564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xugang Lu, M. Unoki, Shigeki Matsuda, Chiori Hori, H. Kashioka
{"title":"Controlling the tradeoff property in a regularization framework for noise reduction","authors":"Xugang Lu, M. Unoki, Shigeki Matsuda, Chiori Hori, H. Kashioka","doi":"10.1109/ISCSLP.2012.6423500","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423500","url":null,"abstract":"The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. We have proposed a regularization framework for noise reduction with the consideration of the tradeoff problem. We regard speech estimation as a functional approximation problem in a reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function that gives a good tradeoff between the approximation accuracy and complexity of the function. By using a regularization method, the approximation function can be estimated from noisy observations. In this paper, we further provided a theoretical analysis of the tradeoff property of the framework in noise reduction. We applied the framework for speech enhancement experiments in real applications. Compared with several classical noise reduction methods, the proposed framework showed promising advantages.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122643745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A cross-dialect comparison of vowel dispersion and vowel variability","authors":"Wai-Sum Lee","doi":"10.1109/ISCSLP.2012.6423458","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423458","url":null,"abstract":"The study is a cross-dialect comparison of the vowel systems of different inventories across five Chinese dialects in terms of vowel dispersion and vowel variability. The dialects include Meixian Kejia or Hakka with 5 vowels, Hong Kong Cantonese with 7 vowels, Fuzhou with 8 vowels, Ningbo with 10 vowels, and Wenling with 11 vowels. Formant frequencies were obtained through spectral analysis of speech data from 10 male and 10 female speakers of each dialect. The findings of this study do not support the vowel dispersion theory which predicts that (i) the larger the vowel inventory is, the more expanded vowel space will be in the F1F2 plane, and (ii) variability in vowel formants is inversely related to vowel inventory size.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130145297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effects of excitation spread on the intelligibility of Mandarin speech in cochlear implant simulations","authors":"Fei Chen, Tian Guan, L. Wong","doi":"10.1109/ISCSLP.2012.6423502","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423502","url":null,"abstract":"Noisy listening conditions remain challenging for most cochlear implant patients. The present study simulated the effects of decay rates of excitation spread in cochlear implants on the intelligibility of Mandarin speech in noise. Mandarin sentence and tone stimuli were processed by noise-vocoder, and presented to normal-hearing listeners for identification. The decay rates of excitation spread were simulated by varying the slopes of synthesis filters in noise-vocoder. Experimental results showed that significant benefit for Mandarin sentence recognition in noise was observed with narrower type of excitation. The performance of Mandarin tone identification was relatively robust to the influence of excitation spread. The results in the present study suggest that reducing the decay rates of excitation spread may potentially improve the speech perception in noise for cochlear implants in the future.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"173 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113996616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tones in whispered Mandarin","authors":"Bin Li, R. Rong","doi":"10.1109/ISCSLP.2012.6423539","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423539","url":null,"abstract":"This paper examines and compares the characteristics of tones in a CV syllable in Mandarin under phonated and whispered speech. Formants of the vowel in various contexts are also compared across the tone environments in different phonation types, in order to assess whether and how tone environments and vowel production interacts, as the paper is interested as well in whether lack of fundamental frequency in whisper is compensated by other phonetic means in a tonal language. Results suggest that temporal correlates are maintained to a certain extent, and that the vowel space is shifted significantly towards higher frequency range.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131216090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A unified trajectory tiling approach to high quality TTS and cross-lingual voice transformation","authors":"Yao Qian, F. Soong","doi":"10.1109/ISCSLP.2012.6423506","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423506","url":null,"abstract":"In human-machine speech communication, it is technically challenging to make the machine talk as naturally as human so as to facilitate “frictionless” interactions, or make a human user to feel the communication is as natural as human-human. We propose a trajectory tiling approach to high quality speech synthesis, where the speech parameter trajectories, extracted from natural, processed, or synthesized speech, are used to guide the search for the best sequence of waveform segment “tiles” stored in a pre-recorded speech database. We test our approach in both TTS and cross-lingual voice transformation applications. Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible. The perceived high quality speech is also confirmed in objective and subjective tests.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114422562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xian-Jun Xia, Zhenhua Ling, Chen-Yu Yang, Lirong Dai
{"title":"Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech","authors":"Xian-Jun Xia, Zhenhua Ling, Chen-Yu Yang, Lirong Dai","doi":"10.1109/ISCSLP.2012.6423524","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423524","url":null,"abstract":"This paper presents an improved unit selection and waveform concatenation speech synthesis method by gathering and utilizing human feedbacks on synthetic speech. Firstly, a set of texts are synthesized by the baseline unit selection synthesis system. Each prosodic word within the synthetic speech is then evaluated as a natural one or an unnatural one by listeners. In our proposed method, these natural synthetic segments are treated as virtual candidate units to extend the original speech corpus for unit selection. A new speech synthesis system is constructed using this extended speech corpus. A synthetic error detector based on SVM classifier is also built using the natural and unnatural synthetic speech. At synthesis time, the input text is synthesized using the baseline system and the extended system simultaneously. The two unit selection results are evaluated by the trained synthetic error detector to determine the optimal one. Experimental results prove the effectiveness of our proposed method in improving the naturalness of synthetic speech on a task of synthesizing place names.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130753248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Alternative hypothesis generation using a weighted kernel feature matrix for ASR substitution error correction","authors":"Chao-Hong Liu, Chung-Hsien Wu, David Sarwono","doi":"10.1109/ISCSLP.2012.6423475","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423475","url":null,"abstract":"Although automatic speech recognition (ASR) has been successfully used in several applications, it is still non-robust and imprecise especially in a harsh environment wherein the input speech is of low quality. Robust error correction for ASR outputs thus becomes important in addition to improving recognition performance. In recent approaches to error correction, linguistic or domain information is used to generate the alternative hypotheses for the ASR outputs followed by the selection of the most likely alternative. In this study, the distances between ASR outputs and the potentially correct alternatives are estimated based on a weighted context-dependent syllable cluster-based kernel feature matrix followed by multidimensional scaling (MDS)-based distance rescaling. These distances are then used to construct an alternative syllable lattice and the dynamic programming is used to obtain the most likely correct output with respect to the original ASR results. Experiments show that the proposed method achieved about 1.95% improvement on the word error rate compared to the correction pair approach using the MATBN Mandarin Chinese broadcast news corpus.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122580341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}