{"title":"Gain estimation approaches in catalog-based single-channel speech-music separation","authors":"Cemil Demir, A. Cemgil, M. Saraçlar","doi":"10.1109/ASRU.2011.6163928","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163928","url":null,"abstract":"In this study, we analyze the gain estimation problem of the catalog-based single-channel speech-music separation method, which we proposed previously. In the proposed method, assuming that we know a catalog of the background music, we developed a generative model for the superposed speech and music spectrograms. We represent the speech spectrogram by a Non-Negative Matrix Factorization (NMF) model and the music spectrogram by a conditional Poisson Mixture Model (PMM). In this model, we assume that the background music is generated by repeating and changing the gain of the jingle in the music catalog. Although the separation performance of the proposed method is satisfactory with known gain values, the performance decreases when the gain value of the jingle is unknown and has to be estimated. In this paper, we address the gain estimation problem of the catalog-based method and propose three different approaches to overcome this problem. One of these approaches is to use Gamma Markov Chain (GMC) probabilistic structure to impose the correlation between the gain parameters across the time frames. By using GMC, the gain parameter is estimated more accurately. The other approaches are maximum a posteriori (MAP) and piece-wise constant estimation (PCE) of the gain values. Although all three methods improve the separation performance as compared to the original method itself, GMC approach achieved the best performance.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122946618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From Modern Standard Arabic to Levantine ASR: Leveraging GALE for dialects","authors":"H. Soltau, L. Mangu, Fadi Biadsy","doi":"10.1109/ASRU.2011.6163942","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163942","url":null,"abstract":"We report a series of experiments about how we can progress from Modern Standard Arabic (MSA) to Levantine ASR, in the context of the GALE DARPA program. While our GALE models achieved very low error rates, we still see error rates twice as high when decoding dialectal data. In this paper, we make use of a state-of-the-art Arabic dialect recognition system to automatically identify Levantine and MSA subsets in mixed speech of a variety of dialects including MSA. Training separate models on these subsets, we show a significant reduction in word error rate over using the entire data set to train one system for both dialects. During decoding, we use a tree array structure to mix Levantine and MSA models automatically using the posterior probabilities of the dialect classifier as soft weights. This technique allows us to mix these models without sacrificing performance for either varieties. Furthermore, using the initial acoustic-based dialect recognition system's output, we show that we can bootstrap a text-based dialect classifier and use it to identify relevant text data for building Levantine language models. Moreover, we compare different vowelization approaches when transitioning from MSA to Levantine models.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126090793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Utterance verification using garbage words for a hospital appointment system with speech interface","authors":"Mitsuru Takaoka, H. Nishizaki, Y. Sekiguchi","doi":"10.1109/ASRU.2011.6163954","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163954","url":null,"abstract":"On a system that captures spoken dialog, users often use out-of-domain utterances to the system. The speech recognition component in the dialog system cannot correctly recognize such utterances, which causes fatal errors. This paper proposes a method to verify whether utterances are in-domain or out-of-domain. The proposed method trains systems with two language models: one that can accept both in-domain and out-of-domain utterances and the other that can accept only in-domain utterances. These models are installed into two speech recognition systems. A comparison of the recognizers' outputs provides a good verification of utterances. We installed our method in a hospital appointment system and evaluated it. The experimental results showed that the proposed method worked well.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121796495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving reverberant VTS for hands-free robust speech recognition","authors":"Yongqiang Wang, M. Gales","doi":"10.1109/ASRU.2011.6163915","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163915","url":null,"abstract":"Model-based approaches to handling additive background noise and channel distortion, such as Vector Taylor Series (VTS), have been intensively studied and extended in a number of ways. In previous work, VTS has been extended to handle both reverberant and background noise, yielding the Reverberant VTS (RVTS) scheme. In this work, rather than assuming the observation vector is generated by the reverberation of a sequence of background noise corrupted speech vectors, as in RVTS, the observation vector is modelled as a superposition of the background noise and the reverberation of clean speech. This yields a new compensation scheme RVTS Joint (RVTSJ), which allows an easy formulation for joint estimation of both additive and reverberation noise parameters. These two compensation schemes were evaluated and compared on a simulated reverberant noise corrupted AURORA4 task. Both yielded large gains over VTS baseline system, with RVTSJ outperforming the previous RVTS scheme.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131188685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Latent semantic analysis for question classification with neural networks","authors":"B. Loni, Seyedeh Halleh Khoshnevis, P. Wiggers","doi":"10.1109/ASRU.2011.6163971","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163971","url":null,"abstract":"An important component of question answering systems is question classification. The task of question classification is to predict the entity type of the answer of a natural language question. Question classification is typically done using machine learning techniques. Most approaches use features based on word unigrams which leads to large feature space. In this work we applied Latent Semantic Analysis (LSA) technique to reduce the large feature space of questions to a much smaller and efficient feature space. We used two different classifiers: Back-Propagation Neural Networks (BPNN) and Support Vector Machines (SVM). We found that applying LSA on question classification can not only make the question classification more time efficient, but it also improves the classification accuracy by removing the redundant features. Furthermore, we discovered that when the original feature space is compact and efficient, its reduced space performs better than a large feature space with a rich set of features. In addition, we found that in the reduced feature space, BPNN performs better than SVMs which are widely used in question classification. Our result on the well known UIUC dataset is competitive with the state-of-the-art in this field, even though we used much smaller feature spaces.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129837047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Bocklet, E. Nöth, G. Stemmer, Hana Ruzickova, J. Rusz
{"title":"Detection of persons with Parkinson's disease by acoustic, vocal, and prosodic analysis","authors":"T. Bocklet, E. Nöth, G. Stemmer, Hana Ruzickova, J. Rusz","doi":"10.1109/ASRU.2011.6163978","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163978","url":null,"abstract":"70% to 90% of patients with Parkinson's disease (PD) show an affected voice. Various studies revealed, that voice and prosody is one of the earliest indicators of PD. The issue of this study is to automatically detect whether the speech/voice of a person is affected by PD. We employ acoustic features, prosodic features and features derived from a two-mass model of the vocal folds on different kinds of speech tests: sustained phonations, syllable repetitions, read texts and monologues. Classification is performed in either case by SVMs. A correlation-based feature selection was performed, in order to identify the most important features for each of these systems. We report recognition results of 91% when trying to differentiate between normal speaking persons and speakers with PD in early stages with prosodic modeling. With acoustic modeling we achieved a recognition rate of 88% and with vocal modeling we achieved 79%. After feature selection these results could greatly be improved. But we expect those results to be too optimistic. We show that read texts and monologues are the most meaningful texts when it comes to the automatic detection of PD based on articulation, voice, and prosodic evaluations. The most important prosodic features were based on energy, pauses and F0. The masses and the compliances of spring were found to be the most important parameters of the two-mass vocal fold model.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129309087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Bouallegue, D. Matrouf, Mickael Rouvier, G. Linarès
{"title":"Subspace Gaussian Mixture Models for vectorial HMM-states representation","authors":"M. Bouallegue, D. Matrouf, Mickael Rouvier, G. Linarès","doi":"10.1109/ASRU.2011.6163984","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163984","url":null,"abstract":"In this paper we present a vectorial representation of the HMM states that is inspired by the Subspace Gaussian Mixture Models paradigm (SGMM). This vectorial representation of states will make possible a large number of applications, such as HMM-states clustering and graphical visualization. Thanks to this representation, the Hidden Markov Model (HMM) states can be seen as sets of points in multi-dimensional space and then can be studied using statistical data analysis techniques. In this paper, we show how this representation can be obtained and used for tying states of an HHM-based automatic speech recognition system without any use of linguistic or phonetic knowledge. In experiments, this approach achieves significant and stable gain, while conserving the classical approach based on decision trees. We also show how it can be used for graphical visualization, which can be useful in other domains like phonetics or clinical phonetics.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"4498 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127720933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Socio-situational setting classification based on language use","authors":"Yangyang Shi, P. Wiggers, C. Jonker","doi":"10.1109/ASRU.2011.6163974","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163974","url":null,"abstract":"We present a method for automatic classification of the socio-situational setting of a conversation based on the language used. The socio-situational setting depicts the social background of a conversation which involves the communicative goals, number of speakers, number of listeners and the relationship among the speakers and the listeners. Knowledge of the socio-situational setting can be used to search for content recorded in a particular setting or to select context-dependent models for example for speech recognition. We investigated the performance of different feature sets of conversation level features and word level features and their combinations on this task. Our final system, that classifies the conversations in the Spoken Dutch Corpus in one of 14 socio-situational settings, achieves an accuracy of 89.55%.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130960302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tara N. Sainath, Brian Kingsbury, B. Ramabhadran, P. Fousek, Petr Novák, Abdel-rahman Mohamed
{"title":"Making Deep Belief Networks effective for large vocabulary continuous speech recognition","authors":"Tara N. Sainath, Brian Kingsbury, B. Ramabhadran, P. Fousek, Petr Novák, Abdel-rahman Mohamed","doi":"10.1109/ASRU.2011.6163900","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163900","url":null,"abstract":"To date, there has been limited work in applying Deep Belief Networks (DBNs) for acoustic modeling in LVCSR tasks, with past work using standard speech features. However, a typical LVCSR system makes use of both feature and model-space speaker adaptation and discriminative training. This paper explores the performance of DBNs in a state-of-the-art LVCSR system, showing improvements over Multi-Layer Perceptrons (MLPs) and GMM/HMMs across a variety of features on an English Broadcast News task. In addition, we provide a recipe for data parallelization of DBN training, showing that data parallelization can provide linear speed-up in the number of machines, without impacting WER.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116734991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Wutiwiwatchai, A. Thangthai, A. Chotimongkol, C. Hansakunbuntheung, N. Thatphithakkul
{"title":"Accent level adjustment in bilingual Thai-English text-to-speech synthesis","authors":"C. Wutiwiwatchai, A. Thangthai, A. Chotimongkol, C. Hansakunbuntheung, N. Thatphithakkul","doi":"10.1109/ASRU.2011.6163947","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163947","url":null,"abstract":"This paper introduces an accent level adjustment mechanism for Thai-English text-to-speech synthesis (TTS). English words often appearing in modern Thai writing can be speech synthesized by either Thai TTS using corresponding Thai phones or by separated English TTS using English phones. As many Thai native listeners may not prefer any of such extreme accent styles, a mechanism that allows selecting accent level preference is proposed. In HMM-based TTS, adjusting the accent level is done by interpolating HMMs of purely Thai and purely English sounds. Solutions for cross-language phone alignment and HMM state mapping are addressed. Evaluations are performed by a listening test on sounds synthesized with varied accent levels. Experimental results show that the proposed method is acceptable by the majority of human listeners.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114714211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}