2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

筛选
英文 中文
Convolutional neural network based triangular CRF for joint intent detection and slot filling 基于卷积神经网络的三角CRF联合意图检测与缝隙填充
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707709
Puyang Xu, R. Sarikaya
{"title":"Convolutional neural network based triangular CRF for joint intent detection and slot filling","authors":"Puyang Xu, R. Sarikaya","doi":"10.1109/ASRU.2013.6707709","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707709","url":null,"abstract":"We describe a joint model for intent detection and slot filling based on convolutional neural networks (CNN). The proposed architecture can be perceived as a neural network (NN) version of the triangular CRF model (TriCRF), in which the intent label and the slot sequence are modeled jointly and their dependencies are exploited. Our slot filling component is a globally normalized CRF style model, as opposed to left-to-right models in recent NN based slot taggers. Its features are automatically extracted through CNN layers and shared by the intent model. We show that our slot model component generates state-of-the-art results, outperforming CRF significantly. Our joint model outperforms the standard TriCRF by 1% absolute for both intent and slot. On a number of other domains, our joint model achieves 0.7-1%, and 0.9-2.1% absolute gains over the independent modeling approach for intent and slot respectively.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126556789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 316
Mixture of mixture n-gram language models 混合n-gram语言模型的混合
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707701
H. Sak, Cyril Allauzen, Kaisuke Nakajima, F. Beaufays
{"title":"Mixture of mixture n-gram language models","authors":"H. Sak, Cyril Allauzen, Kaisuke Nakajima, F. Beaufays","doi":"10.1109/ASRU.2013.6707701","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707701","url":null,"abstract":"This paper presents a language model adaptation technique to build a single static language model from a set of language models each trained on a separate text corpus while aiming to maximize the likelihood of an adaptation data set given as a development set of sentences. The proposed model can be considered as a mixture of mixture language models. The mixture model at the top level is a sentence-level mixture model where each sentence is assumed to be drawn from one of a discrete set of topic or task clusters. After selecting a cluster, each n-gram is assumed to be drawn from one of the given n-gram language models. We estimate cluster mixture weights and n-gram language model mixture weights for each cluster using the expectation-maximization (EM) algorithm to seek the parameter estimates maximizing the likelihood of the development sentences. This mixture of mixture models can be represented efficiently as a static n-gram language model using the previously proposed Bayesian language model interpolation technique. We show a significant improvement with this technique (both perplexity and WER) compared to the standard one level interpolation scheme.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129989742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Towards unsupervised semantic retrieval of spoken content with query expansion based on automatically discovered acoustic patterns 基于自动发现声学模式的查询扩展的口语内容无监督语义检索
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707729
Yun-Chiao Li, Hung-yi Lee, Cheng-Tao Chung, Chun-an Chan, Lin-Shan Lee
{"title":"Towards unsupervised semantic retrieval of spoken content with query expansion based on automatically discovered acoustic patterns","authors":"Yun-Chiao Li, Hung-yi Lee, Cheng-Tao Chung, Chun-an Chan, Lin-Shan Lee","doi":"10.1109/ASRU.2013.6707729","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707729","url":null,"abstract":"This paper presents an initial effort to retrieve semantically related spoken content in a completely unsupervised way. Unsupervised approaches of spoken content retrieval is attractive because the need for annotated data reasonably matched to the spoken content for training acoustic and language models can be bypassed. However, almost all such unsupervised approaches focus on spoken term detection, or returning the spoken segments containing the query, using either template matching techniques such as dynamic time warping (DTW) or model-based approaches. However, users usually prefer to retrieve all objects semantically related to the query, but not necessarily including the query terms. This paper proposes a different approach. We transcribe the spoken segments in the archive to be retrieved through into sequences of acoustic patterns automatically discovered in an unsupervised method. For an input query in spoken form, the top-N spoken segments from the archive obtained with the first-pass retrieval with DTW are taken as pseudo-relevant. The acoustic patterns frequently occurring in these segments are therefore considered as query-related and used for query expansion. Preliminary experiments performed on Mandarin broadcast news offered very encouraging results.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133765456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Automatic sentiment extraction from YouTube videos 从YouTube视频中自动提取情感
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707736
L. Kaushik, A. Sangwan, J. Hansen
{"title":"Automatic sentiment extraction from YouTube videos","authors":"L. Kaushik, A. Sangwan, J. Hansen","doi":"10.1109/ASRU.2013.6707736","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707736","url":null,"abstract":"Extracting speaker sentiment from natural audio streams such as YouTube is challenging. A number of factors contribute to the task difficulty, namely, Automatic Speech Recognition (ASR) of spontaneous speech, unknown background environments, variable source and channel characteristics, accents, diverse topics, etc. In this study, we build upon our previous work [5], where we had proposed a system for detecting sentiment in YouTube videos. Particularly, we propose several enhancements including (i) better text-based sentiment model due to training on larger and more diverse dataset, (ii) an iterative scheme to reduce sentiment model complexity with minimal impact on performance accuracy, (iii) better speech recognition due to superior acoustic modeling and focused (domain dependent) vocabulary/language models, and (iv) a larger evaluation dataset. Collectively, our enhancements provide an absolute 10% improvement over our previous system in terms of sentiment detection accuracy. Additionally, we also present analysis that helps understand the impact of WER (word error rate) on sentiment detection accuracy. Finally, we investigate the relative importance of different Parts-of-Speech (POS) tag features towards sentiment detection. Our analysis reveals the practicality of this technology and also provides several potential directions for future work.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116639731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Dysfluent speech detection by image forensics techniques 基于图像取证技术的语音障碍检测
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707712
Juraj Pálfy, Sakhia Darjaa, Jiri Pospíchal
{"title":"Dysfluent speech detection by image forensics techniques","authors":"Juraj Pálfy, Sakhia Darjaa, Jiri Pospíchal","doi":"10.1109/ASRU.2013.6707712","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707712","url":null,"abstract":"As speech recognition has become popular, the importance of dysfluency detection increased considerably. Once a dysfluent event in spontaneous speech is identified, the speech recognition performance could be enhanced by eliminating its negative effect. Most existing techniques to detect such dysfluent events are based on statistical models. Sparse regularity of dysfluent events and complexity to describe such events in a speech recognition system makes its recognition rigorous. These problems are addressed by our algorithm inspired by image forensics. This paper suggests our algorithm developed to extract novel features of complex dysfluencies. The common steps of classifier design were used to statistically evaluate the proposed features of complex dysfluencies in spectral and cepstral domains. Support vector machines perform objective assessment of MFCC features, MFCC based derived features, PCA based derived features and kernel PCA based derived features of complex dysfluencies, where our derived features increased the performance by 46% opposite to MFCC.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122121230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hybrid speech recognition with Deep Bidirectional LSTM 基于深度双向LSTM的混合语音识别
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707742
Alex Graves, N. Jaitly, Abdel-rahman Mohamed
{"title":"Hybrid speech recognition with Deep Bidirectional LSTM","authors":"Alex Graves, N. Jaitly, Abdel-rahman Mohamed","doi":"10.1109/ASRU.2013.6707742","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707742","url":null,"abstract":"Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130083297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1517
Barge-in effects in Bayesian dialogue act recognition and simulation 贝叶斯对话行为识别与模拟中的碰撞效应
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707713
H. Cuayáhuitl, Nina Dethlefs, H. Hastie, Oliver Lemon
{"title":"Barge-in effects in Bayesian dialogue act recognition and simulation","authors":"H. Cuayáhuitl, Nina Dethlefs, H. Hastie, Oliver Lemon","doi":"10.1109/ASRU.2013.6707713","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707713","url":null,"abstract":"Dialogue act recognition and simulation are traditionally considered separate processes. Here, we argue that both can be fruitfully treated as interleaved processes within the same probabilistic model, leading to a synchronous improvement of performance in both. To demonstrate this, we train multiple Bayes Nets that predict the timing and content of the next user utterance. A specific focus is on providing support for barge-ins. We describe experiments using the Let's Go data that show an improvement in classification accuracy (+5%) in Bayesian dialogue act recognition involving barge-ins using partial context compared to using full context. Our results also indicate that simulated dialogues with user barge-in are more realistic than simulations without barge-in events.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129749783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Vector Taylor series based HMM adaptation for generalized cepstrum in noisy environment 噪声环境下基于矢量泰勒级数的广义倒谱HMM自适应
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707727
Soonho Baek, Hong-Goo Kang
{"title":"Vector Taylor series based HMM adaptation for generalized cepstrum in noisy environment","authors":"Soonho Baek, Hong-Goo Kang","doi":"10.1109/ASRU.2013.6707727","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707727","url":null,"abstract":"This paper proposes a novel HMM adaptation algorithm for robust automatic speech recognition (ASR) system in noisy environments. The HMM adaptation using vector Taylor series (VTS) significantly improves the ASR performance in noisy environments. Recently, the power normalized cepstral coefficient (PNCC) that replaces a logarithmic mapping function with a power mapping function has been proposed and it is proved that the replacement of the mapping function is robust to additive noise. In this paper, we extend the VTS based approach to the cepstral coefficients obtained by using a power mapping function instead of a logarithmic mapping function. Experimental results indicate that HMM adaptation in the cepstrum obtained by using a power mapping function improves the ASR performance comparing the VTS based conventional approach for mel-frequency cepstral coefficients (MFCCs).","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127589387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving robustness of deep neural networks via spectral masking for automatic speech recognition 基于频谱掩蔽的深度神经网络鲁棒性研究
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707743
Bo Li, K. Sim
{"title":"Improving robustness of deep neural networks via spectral masking for automatic speech recognition","authors":"Bo Li, K. Sim","doi":"10.1109/ASRU.2013.6707743","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707743","url":null,"abstract":"The performance of human listeners degrades rather slowly compared to machines in noisy environments. This has been attributed to the ability of performing auditory scene analysis which separates the speech prior to recognition. In this work, we investigate two mask estimation approaches, namely the state dependent and the deep neural network (DNN) based estimations, to separate speech from noises for improving DNN acoustic models' noise robustness. The second approach has been experimentally shown to outperform the first one. Due to the stereo data based training and ill-defined masks for speech with channel distortions, both methods do not generalize well to unseen conditions and fail to beat the performance of the multi-style trained baseline system. However, the model trained on masked features demonstrates strong complementariness to the baseline model. The simple average of the two system's posteriors yields word error rates of 4.4% on Aurora2 and 12.3% on Aurora4.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122771959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Impact of deep MLP architecture on different acoustic modeling techniques for under-resourced speech recognition 深度MLP架构对资源不足语音识别中不同声学建模技术的影响
2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707752
David Imseng, P. Motlícek, Philip N. Garner, H. Bourlard
{"title":"Impact of deep MLP architecture on different acoustic modeling techniques for under-resourced speech recognition","authors":"David Imseng, P. Motlícek, Philip N. Garner, H. Bourlard","doi":"10.1109/ASRU.2013.6707752","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707752","url":null,"abstract":"Posterior based acoustic modeling techniques such as Kullback-Leibler divergence based HMM (KL-HMM) and Tandem are able to exploit out-of-language data through posterior features, estimated by a Multi-Layer Perceptron (MLP). In this paper, we investigate the performance of posterior based approaches in the context of under-resourced speech recognition when a standard three-layer MLP is replaced by a deeper five-layer MLP. The deeper MLP architecture yields similar gains of about 15% (relative) for Tandem, KL-HMM as well as for a hybrid HMM/MLP system that directly uses the posterior estimates as emission probabilities. The best performing system, a bilingual KL-HMM based on a deep MLP, jointly trained on Afrikaans and Dutch data, performs 13% better than a hybrid system using the same bilingual MLP and 26% better than a subspace Gaussian mixture system only trained on Afrikaans data.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"2677 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124121241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信