Chen-Yu Yang, Georgina Brown, Liang Lu, J. Yamagishi, Simon King
{"title":"Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation","authors":"Chen-Yu Yang, Georgina Brown, Liang Lu, J. Yamagishi, Simon King","doi":"10.1109/ISCSLP.2012.6423522","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423522","url":null,"abstract":"In this paper, we introduce a newly-created corpus of whispered speech simultaneously recorded via a close-talking microphone and a non-audible murmur (NAM) microphone in both clean and noisy conditions. To benchmark the corpus, which has been freely released recently, experiments on automatic recognition of continuous whispered speech were conducted. When training and test conditions are matched, the NAM microphone is found to be more robust against background noise than the close-talking microphone. In mismatched conditions (noisy data, models trained on clean speech), we found that Vector Taylor Series (VTS) compensation is particularly effective for the NAM signal.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131471966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Zhao, Hongcui Wang, Songgun Hyon, Jianguo Wei, J. Dang
{"title":"Efficient feature extraction of speaker identification using phoneme mean F-ratio for Chinese","authors":"Chen Zhao, Hongcui Wang, Songgun Hyon, Jianguo Wei, J. Dang","doi":"10.1109/ISCSLP.2012.6423485","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423485","url":null,"abstract":"The features used for speaker recognition should have more speaker individual information while attenuating the linguistic information. In order to discard the linguistic information effectively, in this paper, we employed the phoneme mean F-ratio method to investigate the different contributions of different frequency region from the point of view of Chinese phoneme, and apply it for speaker identification. It is found that the speaker individual information depending on the phonemes is distributed in different frequency regions of speech sound. Based on the contribution rate, we extracted the new features and combined with GMM model. The experiment for speaker identification task is conducted with a King-ASR Chinese database. Compared with the MFCC feature, the identification error rate with the proposed feature was reduced by 32.94%. The results confirmed that the efficiency of the phoneme mean F-ratio method for improving speaker recognition performance for Chinese.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121059385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Leow, T. S. Lau, Alvina Goh, Han Meng Peh, Teck Khim Ng, S. Siniscalchi, Chin-Hui Lee
{"title":"A new confidence measure combining Hidden Markov Models and Artificial Neural Networks of phonemes for effective keyword spotting","authors":"S. Leow, T. S. Lau, Alvina Goh, Han Meng Peh, Teck Khim Ng, S. Siniscalchi, Chin-Hui Lee","doi":"10.1109/ISCSLP.2012.6423455","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423455","url":null,"abstract":"In this paper, we present an acoustic keyword spotter that operates in two stages, detection and verification. In the detection stage, keywords are detected in the utterances, and in the verification stage, confidence measures are used to verify the detected keywords and reject false alarms. A new confidence measure, based on phoneme models trained on an Artificial Neural Network, is used in the verification stage to reduce false alarms. We have found that this ANN-based confidence, together with existing HMM-based confidence measures, is very effective in rejecting false alarms. Experiments are performed on two Mandarin databases and our results show that the proposed method is able to significantly reduce the number of false alarms.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117291743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perceptual similarity between audio clips and feature selection for its measurement","authors":"Qinghua Wu, Xiao-Lei Zhang, Ping Lv, Ji Wu","doi":"10.1109/ISCSLP.2012.6423476","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423476","url":null,"abstract":"In this paper, we explore the retrieval of perceptually similar audio. It focuses on finding sounds according to human perceptions. Thus such retrieval is more “human-centered” [1] than previous audio retrievals which intend to find homologous sounds. We make comprehensive use of various acoustic features to measure the perceptual similarity. Since some acoustic features may be redundant or even adverse to the similarity measurement, we propose to find a complementary and effective combination of acoustic features via SFFS (Sequential Floating Forward Selection) method. Experimental results show that LSP, MFCC, and PLP are the three most effective acoustic features. Moreover, the optimal combination of features can improve the accuracy of similarity classification by about 2% compared with the best performance of a single acoustic feature.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128652647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross validation and Minimum Generation Error for improved model clustering in HMM-based TTS","authors":"Fenglong Xie, Yi-Jian Wu, F. Soong","doi":"10.1109/ISCSLP.2012.6423459","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423459","url":null,"abstract":"In HMM-based speech synthesis, context-dependent hidden Markov model (HMM) is widely used for its capability to synthesize highly intelligible and fairly smooth speech. However, to train HMMs of all possible contexts well is difficult, or even impossible, due to the intrinsic, insufficient training data coverage problem. As a result, thus trained models may over fit and their capability in predicting any unseen context in test is highly restricted. Recently cross-validation (CV) has been explored and applied to the decision tree-based clustering with the Maximum-Likelihood (ML) criterion and showed improved robustness in TTS synthesis. In this paper we generalize CV to decision tree clustering but with a different, Minimum Generation Error (MGE), criterion. Experimental results show that the generalization to MGE results in better TTS synthesis performance than that of the baseline systems.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129084301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TDOA information based vad for robust speech recognition in directional and diffuse noise field","authors":"Kuan-Lang Huang, T. Chi","doi":"10.1109/ISCSLP.2012.6423514","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423514","url":null,"abstract":"A two-microphone algorithm is proposed to improve automatic speech recognition (ASR) rates when target speech is corrupted by directional interferences and diffuse noise simultaneously. The algorithm adopts the time difference of arrival (TDOA) to suppress directional interferences and a TDOA-information based voice activity detector (VAD) to suppress diffuse noise. Simulation results show the proposed algorithm is effective in improving ASR rates in a sound field mixed with a directional interference and diffuse noise. Compared with the phase difference (PD) algorithm, the proposed method gives comparable recognition rates when facing a directional interference and much higher and more robust recognition rates when diffuse noise emerges.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127040670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lichun Fan, Dengfeng Ke, Xiaoyin Fu, Shixiang Lu, Bo Xu
{"title":"Power-normalized PLP (PNPLP) feature for robust speech recognition","authors":"Lichun Fan, Dengfeng Ke, Xiaoyin Fu, Shixiang Lu, Bo Xu","doi":"10.1109/ISCSLP.2012.6423529","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423529","url":null,"abstract":"In this paper, we first review several approaches of feature extraction algorithms in robust speech recognition, e.g. Mel frequency cepstral coefficients (MFCC) [1], perceptual linear prediction (PLP) [2] and power-normalized cepstral coefficients (PNCC) [3]. A new feature extraction algorithm for noise robust speech recognition is proposed, in which medium-time processing works as noise suppression module. The details will be described to show that the algorithm is superior. The experimental results prove that our proposed method significantly outperforms state-of-the-art algorithms.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133125397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Break index labeling of mandarin text via syntactic-to-prosodic tree mapping","authors":"Xiaotian Zhang, Yao Qian, Hai Zhao, F. Soong","doi":"10.1109/ISCSLP.2012.6423468","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423468","url":null,"abstract":"In this study, we investigate the break index labeling problem with a syntactic-to-prosodic structure conversion. The statistical relationship between the mapped syntactic tree structure and prosodic tree structure of sentences in the training set is used to generate a Synchronous Tree Substitution Grammar (STSG) which can describe the probabilistic mapping (substitution) rules between them. For a given test sentence and the corresponding parsed syntactic tree structure, thus generated STSG can convert the syntactic tree to a prosodic tree statistically. We compare the labeling results with other approaches and show the probabilistic mapping can indeed benefit break index labeling performance.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117354779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discriminant local information distance preserving projection for text-independent speaker recognition","authors":"Liang He, Jia Li","doi":"10.1109/ISCSLP.2012.6423466","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423466","url":null,"abstract":"A novel method is presented based on a statistical manifold for text-independent speaker recognition. After feature extraction, speaker recognition becomes a sequence classification problem. By discarding time information, the core task is the comparison of multiple sample sets. Each set is assumed to be governed by a probability density function (PDF). We estimate the PDFs and place the estimated statistical models on a statistical manifold. Fisher information distance is applied to compute distance between adjacent PDFs. Discriminant local preserving projection is used to push adjacent PDFs which belong to different classes apart to further improve the recognition accuracy. Experiments were carried out on the NIST SRE08 tel-tel database. Our presented method gave an excellent performance.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"482 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121160930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparative study of fMPE and RDLT approaches to LVCSR","authors":"Jian Xu, Zhijie Yan, Qiang Huo","doi":"10.1109/ISCSLP.2012.6423511","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423511","url":null,"abstract":"This paper presents a comparative study of two discriminatively trained feature transform approaches, namely feature-space minimum phone error (fMPE) and region-dependent linear transform (RDLT), to large vocabulary continuous speech recognition (LVCSR). Experiments are performed on an LVCSR task of conversational telephone speech transcription using about 2,000 hours training data. Starting from a maximum likelihood (ML) trained GMM-HMM based baseline system, recognition accuracy and run-time efficiency of different variants of the above two methods are evaluated, and a specific RDLT approach is identified and recommended for deployment in LVCSR applications.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"38 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114033087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}