2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

筛选
英文 中文
Using temporal information for improving articulatory-acoustic feature classification 利用时间信息改进发音声学特征分类
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373314
Barbara Schuppler, Joost van Doremalen, O. Scharenborg, B. Cranen, L. Boves
{"title":"Using temporal information for improving articulatory-acoustic feature classification","authors":"Barbara Schuppler, Joost van Doremalen, O. Scharenborg, B. Cranen, L. Boves","doi":"10.1109/ASRU.2009.5373314","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373314","url":null,"abstract":"This paper combines acoustic features with a high temporal and a high frequency resolution to reliably classify articulatory events of short duration, such as bursts in plosives. SVM classification experiments on TIMIT and SVArticulatory showed that articulatory-acoustic features (AFs) based on a combination of MFCCs derived from a long window of 25ms and a short window of 5ms that are both shifted with 2.5ms steps (Both) outperform standard MFCCs derived with a window of 25 ms and a shift of 10 ms (Baseline). Finally, comparison of the TIMIT and SVArticulatory results showed that for classifiers trained on data that allows for asynchronously changing AFs (SVArticulatory) the improvement from Baseline to Both is larger than for classifiers trained on data where AFs change simultaneously with the phone boundaries (TIMIT).","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131033030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Investigations on features for log-linear acoustic models in continuous speech recognition 连续语音识别中对数线性声学模型的特征研究
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373362
Simon Wiesler, M. Nußbaum-Thom, G. Heigold, R. Schlüter, H. Ney
{"title":"Investigations on features for log-linear acoustic models in continuous speech recognition","authors":"Simon Wiesler, M. Nußbaum-Thom, G. Heigold, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2009.5373362","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373362","url":null,"abstract":"Hidden Markov Models with Gaussian Mixture Models as emission probabilities (GHMMs) are the underlying structure of all state-of-the-art speech recognition systems. Using Gaussian mixture distributions follows the generative approach where the class-conditional probability is modeled, although for classification only the posterior probability is needed. Though being very successful in related tasks like Natural Language Processing (NLP), in speech recognition direct modeling of posterior probabilities with log-linear models has rarely been used and has not been applied successfully to continuous speech recognition. In this paper we report competitive results for a speech recognizer with a log-linear acoustic model on the Wall Street Journal corpus, a Large Vocabulary Continuous Speech Recognition (LVCSR) task. We trained this model from scratch, i.e. without relying on an existing GHMM system. Previously the use of data dependent sparse features for log-linear models has been proposed. We compare them with polynomial features and show that the combination of polynomial and data dependent sparse features leads to better results.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125881899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Multi-view learning of acoustic features for speaker recognition 说话人识别声学特征的多视角学习
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373462
Karen Livescu, Mark Stoehr
{"title":"Multi-view learning of acoustic features for speaker recognition","authors":"Karen Livescu, Mark Stoehr","doi":"10.1109/ASRU.2009.5373462","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373462","url":null,"abstract":"We consider learning acoustic feature transformations using an additional view of the data, in this case video of the speaker's face. Specifically, we consider a scenario in which clean audio and video is available at training time, while at test time only noisy audio is available. We use canonical correlation analysis (CCA) to learn linear projections of the acoustic observations that have maximum correlation with the video frames. We provide an initial demonstration of the approach on a speaker recognition task using data from the VidTIMIT corpus. The projected features, in combination with baseline MFCCs, outperform the baseline recognizer in noisy conditions. The techniques we present are quite general, although here we apply them to the case of a specific speaker recognition task. This is the first work of which we are aware in which multiple views are used to learn an acoustic feature projection at training time, while using only the acoustics at test time.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124613608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Kernel metric learning for phonetic classification 语音分类的核度量学习
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373389
J. Huang, Xi Zhou, M. Hasegawa-Johnson, Thomas S. Huang
{"title":"Kernel metric learning for phonetic classification","authors":"J. Huang, Xi Zhou, M. Hasegawa-Johnson, Thomas S. Huang","doi":"10.1109/ASRU.2009.5373389","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373389","url":null,"abstract":"While a sound spoken is described by a handful of frame-level spectral vectors, not all frames have equal contribution for either human perception or machine classification. In this paper, we introduce a novel framework to automatically emphasize important speech frames relevant to phonetic information. We jointly learn the importance of speech frames by a distance metric across the phone classes, attempting to satisfy a large margin constraint: the distance from a segment to its correct label class should be less than the distance to any other phone class by the largest possible margin. Furthermore, an universal background model structure is proposed to give the correspondence between statistical models of phone types and tokens, allowing us to use statistical models of each phone token in a large margin speech recognition framework. Experiments on TIMIT database demonstrated the effectiveness of our framework.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126272397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Transition features for CRF-based speech recognition and boundary detection 基于crf的语音识别和边界检测的过渡特征
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373287
Spiros Dimopoulos, E. Fosler-Lussier, Chin-Hui Lee, A. Potamianos
{"title":"Transition features for CRF-based speech recognition and boundary detection","authors":"Spiros Dimopoulos, E. Fosler-Lussier, Chin-Hui Lee, A. Potamianos","doi":"10.1109/ASRU.2009.5373287","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373287","url":null,"abstract":"In this paper, we investigate a variety of spectral and time domain features for explicitly modeling phonetic transitions in speech recognition. Specifically, spectral and energy distance metrics, as well as, time derivatives of phonological descriptors and MFCCs are employed. The features are integrated in an extended Conditional Random Fields statistical modeling framework that supports general-purpose transition models. For evaluation purposes, we measure both phonetic recognition task accuracy and precision/recall of boundary detection. Results show that when transition features are used in a CRF-based recognition framework, recognition performance improves significantly due to the reduction of phone deletions. The boundary detection performance also improves mainly for transitions among silence, stop, and fricative phonetic classes.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123563996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Robust vocabulary independent keyword spotting with graphical models 具有图形模型的鲁棒词汇独立关键字发现
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373544
M. Wöllmer, F. Eyben, Björn Schuller, G. Rigoll
{"title":"Robust vocabulary independent keyword spotting with graphical models","authors":"M. Wöllmer, F. Eyben, Björn Schuller, G. Rigoll","doi":"10.1109/ASRU.2009.5373544","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373544","url":null,"abstract":"This paper introduces a novel graphical model architecture for robust and vocabulary independent keyword spotting which does not require the training of an explicit garbage model. We show how a graphical model structure for phoneme recognition can be extended to a keyword spotter that is robust with respect to phoneme recognition errors. We use a hidden garbage variable together with the concept of switching parents to model keywords as well as arbitrary speech. This implies that keywords can be added to the vocabulary without having to re-train the model. Thereby the design of our model architecture is optimised to reliably detect keywords rather than to decode keyword phoneme sequences as arbitrary speech, while offering a parameter to adjust the operating point on the receiver operating characteristics curve. Experiments on the TIMIT corpus reveal that our graphical model outperforms a comparable hidden Markov model based keyword spotter that uses conventional garbage modelling.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125197866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Discriminative adaptive training with VTS and JUD 基于VTS和JUD的判别适应性训练
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373266
F. Flego, M. Gales
{"title":"Discriminative adaptive training with VTS and JUD","authors":"F. Flego, M. Gales","doi":"10.1109/ASRU.2009.5373266","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373266","url":null,"abstract":"Adaptive training is a powerful approach for building speech recognition systems on non-homogeneous training data. Recently approaches based on predictive model-based compensation schemes, such as Joint Uncertainty Decoding (JUD) and Vector Taylor Series (VTS), have been proposed. This paper reviews these model-based compensation schemes and relates them to factor-analysis style systems. Forms of Maximum Likelihood (ML) adaptive training with these approaches are described, based on both second-order optimisation schemes and Expectation Maximisation (EM). However, discriminative training is used in many state-of-the-art speech recognition. Hence, this paper proposes discriminative adaptive training with predictive model-compensation approaches for noise robust speech recognition. This training approach is applied to both JUD and VTS compensation with minimum phone error training. A large scale multi-environment training configuration is used and the systems evaluated on a range of in-car collected data tasks.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117247743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Garbage modeling with decoys for a sequential recognition scenario 针对顺序识别场景的带有诱饵的垃圾建模
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372919
Michael Levit, Shuangyu Chang, B. Buntschuh
{"title":"Garbage modeling with decoys for a sequential recognition scenario","authors":"Michael Levit, Shuangyu Chang, B. Buntschuh","doi":"10.1109/ASRU.2009.5372919","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372919","url":null,"abstract":"This paper is concerned with a speech recognition scenario where two unequal ASR systems, one fast with constrained resources, the other significantly slower but also much more powerful, work together in a sequential manner. In particular, we focus on decisions when to accept the results of the first recognizer and when the second recognizer needs to be consulted. As a kind of application-dependent garbage modeling, we suggest an algorithm that augments the grammar of the first recognizer with those valid paths through the language model of the second recognizer that are confusable with the phrases from this grammar. We show how this algorithm outperforms a system that only looks at recognition confidences by about 20% relative.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116844984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Automatic selection of recognition errors by respeaking the intended text 通过说出预期的文本自动选择识别错误
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373347
K. Vertanen, P. Kristensson
{"title":"Automatic selection of recognition errors by respeaking the intended text","authors":"K. Vertanen, P. Kristensson","doi":"10.1109/ASRU.2009.5373347","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373347","url":null,"abstract":"We investigate how to automatically align spoken corrections with an initial speech recognition result. Such automatic alignment would enable one-step voice-only correction in which users simply respeak their intended text. We present three new models for automatically aligning corrections: a 1-best model, a word confusion network model, and a revision model. The revision model allows users to alter what they intended to write even when the initial recognition was completely correct. We evaluate our models with data gathered from two user studies. We show that providing just a single correct word of context dramatically improves alignment success from 65% to 84%. We find that a majority of users provide such context without being explicitly instructed to do so. We find that the revision model is superior when users modify words in their initial recognition, improving alignment success from 73% to 83%. We show how our models can easily incorporate prior information about correction location and we show that such information aids alignment success. Last, we observe that users speak their intended text faster and with fewer re-recordings than if they are forced to speak misrecognized text.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129630941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Robust distributed speech recognition using two-stage Filtered Minima Controlled Recursive Averaging 基于两阶段滤波最小控制递归平均的鲁棒分布式语音识别
2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372925
Negar Ghourchian, S. Selouani, D. O'Shaughnessy
{"title":"Robust distributed speech recognition using two-stage Filtered Minima Controlled Recursive Averaging","authors":"Negar Ghourchian, S. Selouani, D. O'Shaughnessy","doi":"10.1109/ASRU.2009.5372925","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372925","url":null,"abstract":"This paper examines the use of a new Filtered Minima-Controlled Recursive Averaging (FMCRA) noise estimation technique as a robust front-end processing to improve the performance of a Distributed Speech Recognition (DSR) system in noisy environments. The noisy speech is enhanced by using a two-stage framework in order to simultaneously address the inefficiency of the Voice Activity Detector (VAD) and to remedy the inadequacies of MCRA. The performance evaluation carried out on the Aurora 2 task showed that the inclusion of FMCRA in the front-end side leads to a significant improvement in DSR accuracy.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126363374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信