2012 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
On the generalization of Shannon entropy for speech recognition 香农熵在语音识别中的推广
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424204
Nicolas Obin, M. Liuni
{"title":"On the generalization of Shannon entropy for speech recognition","authors":"Nicolas Obin, M. Liuni","doi":"10.1109/SLT.2012.6424204","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424204","url":null,"abstract":"This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Rényi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Rényi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distribution of amplitudes). The proposed representation outperforms all other noisiness measures - including Shannon and Wiener entropies - in a large-scale classification of vocal effort (whispered-soft/normal/loud-shouted) in the real scenario of multi-language massive role-playing video games. The improvement is around 10% in relative error reduction, and is particularly significant for the recognition of noisy speech - i.e., whispery/breathy speech. This confirms the role of noisiness for speech recognition, and will further be extended to the classification of voice quality for the design of an automatic voice casting system in video games.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126208023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion 基于混合核和阈值融合的多类支持向量机语音情感分类
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424267
Na Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman, Melissa L. Sturge‐Apple
{"title":"Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion","authors":"Na Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman, Melissa L. Sturge‐Apple","doi":"10.1109/SLT.2012.6424267","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424267","url":null,"abstract":"Emotion classification is essential for understanding human interactions and hence is a vital component of behavioral studies. Although numerous algorithms have been developed, the emotion classification accuracy is still short of what is desired for the algorithms to be used in real systems. In this paper, we evaluate an approach where basic acoustic features are extracted from speech samples, and the One-Against-All (OAA) Support Vector Machine (SVM) learning algorithm is used. We use a novel hybrid kernel, where we choose the optimal kernel functions for the individual OAA classifiers. Outputs from the OAA classifiers are normalized and combined using a thresholding fusion mechanism to finally classify the emotion. Samples with low `relative confidence' are left as `unclassified' to further improve the classification accuracy. Results show that the decision-level recall of our approach for six-class emotion classification is 80.5%, outperforming a state-of-the-art approach that uses the same dataset.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115376629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Policy optimisation of POMDP-based dialogue systems without state space compression 无状态空间压缩的基于pomdp的对话系统策略优化
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424165
Milica Gasic, Matthew Henderson, Blaise Thomson, P. Tsiakoulis, S. Young
{"title":"Policy optimisation of POMDP-based dialogue systems without state space compression","authors":"Milica Gasic, Matthew Henderson, Blaise Thomson, P. Tsiakoulis, S. Young","doi":"10.1109/SLT.2012.6424165","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424165","url":null,"abstract":"The partially observable Markov decision process (POMDP) has been proposed as a dialogue model that enables automatic improvement of the dialogue policy and robustness to speech understanding errors. It requires, however, a large number of dialogues to train the dialogue policy. Gaussian processes (GP) have recently been applied to POMDP dialogue management optimisation showing an ability to substantially increase the speed of learning. Here, we investigate this further using the Bayesian Update of Dialogue State dialogue manager. We show that it is possible to apply Gaussian processes directly to the belief state, removing the need for a parametric policy representation. In addition, the resulting policy learns significantly faster while maintaining operational performance.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116297449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications 个性化的语言建模,通过群体外包与社会网络数据的云应用程序的语音访问
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424220
Tsung-Hsien Wen, Hung-yi Lee, Tai-Yuan Chen, Lin-Shan Lee
{"title":"Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications","authors":"Tsung-Hsien Wen, Hung-yi Lee, Tai-Yuan Chen, Lin-Shan Lee","doi":"10.1109/SLT.2012.6424220","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424220","url":null,"abstract":"Voice access of cloud applications via smartphones is very attractive today, specifically because a smartphones is used by a single user, so personalized acoustic/language models become feasible. In particular, huge quantities of texts are available within the social networks over the Internet with known authors and given relationships, it is possible to train personalized language models because it is reasonable to assume users with those relationships may share some common subject topics, wording habits and linguistic patterns. In this paper, we propose an adaptation framework for building a robust personalized language model by incorporating the texts the target user and other users had posted on the social networks over the Internet to take care of the linguistic mismatch across different users. Experiments on Facebook dataset showed encouraging improvements in terms of both model perplexity and recognition accuracy with proposed approaches considering relationships among users, similarity based on latent topics, and random walk over a user graph.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130066166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Lexical entrainment and success in student engineering groups 学生工程小组的词汇学习与成功
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424258
Heather Friedberg, D. Litman, Susannah B. F. Paletz
{"title":"Lexical entrainment and success in student engineering groups","authors":"Heather Friedberg, D. Litman, Susannah B. F. Paletz","doi":"10.1109/SLT.2012.6424258","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424258","url":null,"abstract":"Lexical entrainment is a measure of how the words that speakers use in a conversation become more similar over time. In this paper, we propose a measure of lexical entrainment for multi-party speaking situations. We apply this score to a corpus of student engineering groups using high-frequency words and project words, and investigate the relationship between lexical entrainment and group success on a class project. Our initial findings show that, using the entrainment score with project-related words, there is a significant difference between the lexical entrainment of high performing groups, which tended to increase with time, and the entrainment for low performing groups, which tended to decrease with time.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125009370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Automatic transcription of academic lectures from diverse disciplines 自动转录来自不同学科的学术讲座
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424257
Ghada Alharbi, Thomas Hain
{"title":"Automatic transcription of academic lectures from diverse disciplines","authors":"Ghada Alharbi, Thomas Hain","doi":"10.1109/SLT.2012.6424257","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424257","url":null,"abstract":"In a multimedia world it is now common to record professional presentations, on video or with audio only. Such recordings include talks and academic lectures, which are becoming a valuable resource for students and professionals alike. However, organising such material from a diverse set of disciplines seems to be not an easy task. One way to address this problem is to build an Automatic Speech Recognition (ASR) system in order to use its output for analysing such materials. In this work ASR results for lectures from diverse sources are presented. The work is based on a new collection of data, obtained by the Liberated Learning Consortium (LLC). The study's primary goals are two-fold: first to show variability across disciplines from an ASR perspective, and how to choose sources for the construction of language models (LMs); second, to provide an analysis of the lecture transcription for automatic determination of structures in lecture discourse. In particular, we investigate whether there are properties common to lectures from different disciplines. This study focuses on textual features. Lectures are multimodal experiences - it is not clear whether textual features alone are sufficient for the recognition of such common elements, or other features, e.g. acoustic features such as the speaking rate, are needed. The results show that such common properties are retained across disciplines even on ASR output with a Word Error Rate (WER) of 30%.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126434240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition 结合倒谱归一化与人工耳蜗类语音处理的麦克风阵列语音识别
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424211
Cong-Thanh Do, M. Taghizadeh, Philip N. Garner
{"title":"Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition","authors":"Cong-Thanh Do, M. Taghizadeh, Philip N. Garner","doi":"10.1109/SLT.2012.6424211","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424211","url":null,"abstract":"This paper investigates the combination of cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition. Testing speech signals are recorded by a circular microphone array and are subsequently processed with superdirective beamforming and McCowan post-filtering. Training speech signals, from the multichannel overlapping Number corpus (MONC), are clean and not overlapping. Cochlear implant-like speech processing, which is inspired from the speech processing strategy in cochlear implants, is applied on the training and testing speech signals. Cepstral normalization, including cepstral mean and variance normalization (CMN and CVN), are applied on the training and testing cepstra. Experiments show that implementing either cepstral normalization or cochlear implant-like speech processing helps in reducing the WERs of microphone array-based speech recognition. Combining cepstral normalization and cochlear implant-like speech processing reduces further the WERs, when there is overlapping speech. Train/test mismatches are measured using the Kullback-Leibler divergence (KLD), between the global probability density functions (PDFs) of training and testing cepstral vectors. This measure reveals a train/test mismatch reduction when either cepstral normalization or cochlear implant-like speech processing is used. It reveals also that combining these two processing reduces further the train/test mismatches as well as the WERs.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116829370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A noise-robust speech recognition method composed of weak noise suppression and weak Vector Taylor Series Adaptation 一种由弱噪声抑制和弱矢量泰勒级数自适应组成的噪声鲁棒语音识别方法
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424205
Shuji Komeiji, T. Arakawa, Takafumi Koshinaka
{"title":"A noise-robust speech recognition method composed of weak noise suppression and weak Vector Taylor Series Adaptation","authors":"Shuji Komeiji, T. Arakawa, Takafumi Koshinaka","doi":"10.1109/SLT.2012.6424205","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424205","url":null,"abstract":"This paper proposes a noise-robust speech recognition method composed of weak noise suppression (NS) and weak Vector Taylor Series Adaptation (VTSA). The proposed method compensates defects of NS and VTSA, and gains only the advantages by them. The weak NS reduces distortion by over-suppression that may accompany noise-suppressed speech. The weak VTSA avoids over-adaptation by offsetting a part of acoustic-model adaptation that corresponds to the suppressed noise. Evaluation results with the AURORA2 database show that the proposed method achieves as much as 1.2 points higher word accuracy (87.4%) than a method with VTSA alone (86.2%) that is always better than its counterpart with NS.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128884900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic detection and correction of syntax-based prosody annotation errors 基于句法的韵律标注错误自动检测与修正
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424259
Sandrine Brognaux, Thomas Drugman, Richard Beaufort
{"title":"Automatic detection and correction of syntax-based prosody annotation errors","authors":"Sandrine Brognaux, Thomas Drugman, Richard Beaufort","doi":"10.1109/SLT.2012.6424259","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424259","url":null,"abstract":"Both unit-selection and HMM-based speech synthesis require large annotated speech corpora. To generate more natural speech, considering the prosodic nature of each phoneme of the corpus is crucial. Generally, phonemes are assigned labels which should reflect their suprasegmental characteristics. Labels often result from an automatic syntactic analysis, without checking the acoustic realization of the phoneme in the corpus. This leads to numerous errors because syntax and prosody do not always coincide. This paper proposes a method to reduce the amount of labeling errors, using acoustic information. It is applicable as a post-process to any syntax-driven prosody labeling. Acoustic features are considered, to check the syntax-based labels and suggest potential modifications. The proposed technique has the advantage of not requiring a manually prosody-labelled corpus. The evaluation on a corpus in French shows that more than 75% of the errors detected by the method are effective errors which must be corrected.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114330805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Context-dependent Deep Neural Networks for audio indexing of real-life data 上下文相关的深度神经网络音频索引的现实生活数据
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424212
Gang Li, Huifeng Zhu, G. Cheng, Kit Thambiratnam, Behrooz Chitsaz, Dong Yu, F. Seide
{"title":"Context-dependent Deep Neural Networks for audio indexing of real-life data","authors":"Gang Li, Huifeng Zhu, G. Cheng, Kit Thambiratnam, Behrooz Chitsaz, Dong Yu, F. Seide","doi":"10.1109/SLT.2012.6424212","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424212","url":null,"abstract":"We apply Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, to the real-life problem of audio indexing of data across various sources. Recently, we had shown that on the Switchboard benchmark on speaker-independent transcription of phone calls, CD-DNN-HMMs with 7 hidden layers reduce the word error rate by as much as one-third, compared to discriminatively trained Gaussian-mixture HMMs, and by one-fourth if the GMM-HMM also uses fMPE features. This paper takes CD-DNN-HMM based recognition into a real-life deployment for audio indexing. We find that for our best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data (video podcasts and talks). Compared to a speaker-adaptive GMM system, the relative improvement is 18%, at very similar end-to-end runtime. In system building, we find that DNNs can benefit from a larger number of senones than the GMM-HMM; and that DNN likelihood evaluation is a sizeable runtime factor even in our wide-beam context of generating rich lattices: Cutting the model size by 60% reduces runtime by one-third at a 5% relative WER loss.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132380891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信