2012 IEEE Spoken Language Technology Workshop (SLT)最新文献_第7页

On the generalization of Shannon entropy for speech recognition 香农熵在语音识别中的推广

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424204

Nicolas Obin, M. Liuni

引用次数: 17

Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion 基于混合核和阈值融合的多类支持向量机语音情感分类

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424267

Na Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman, Melissa L. Sturge‐Apple

引用次数: 54

Policy optimisation of POMDP-based dialogue systems without state space compression 无状态空间压缩的基于pomdp的对话系统策略优化

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424165

Milica Gasic, Matthew Henderson, Blaise Thomson, P. Tsiakoulis, S. Young

引用次数: 21

Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications 个性化的语言建模，通过群体外包与社会网络数据的云应用程序的语音访问

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424220

Tsung-Hsien Wen, Hung-yi Lee, Tai-Yuan Chen, Lin-Shan Lee

引用次数: 11

Lexical entrainment and success in student engineering groups 学生工程小组的词汇学习与成功

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424258

Heather Friedberg, D. Litman, Susannah B. F. Paletz

引用次数: 57

Automatic transcription of academic lectures from diverse disciplines 自动转录来自不同学科的学术讲座

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424257

Ghada Alharbi, Thomas Hain

{"title":"Automatic transcription of academic lectures from diverse disciplines","authors":"Ghada Alharbi, Thomas Hain","doi":"10.1109/SLT.2012.6424257","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424257","url":null,"abstract":"In a multimedia world it is now common to record professional presentations, on video or with audio only. Such recordings include talks and academic lectures, which are becoming a valuable resource for students and professionals alike. However, organising such material from a diverse set of disciplines seems to be not an easy task. One way to address this problem is to build an Automatic Speech Recognition (ASR) system in order to use its output for analysing such materials. In this work ASR results for lectures from diverse sources are presented. The work is based on a new collection of data, obtained by the Liberated Learning Consortium (LLC). The study's primary goals are two-fold: first to show variability across disciplines from an ASR perspective, and how to choose sources for the construction of language models (LMs); second, to provide an analysis of the lecture transcription for automatic determination of structures in lecture discourse. In particular, we investigate whether there are properties common to lectures from different disciplines. This study focuses on textual features. Lectures are multimodal experiences - it is not clear whether textual features alone are sufficient for the recognition of such common elements, or other features, e.g. acoustic features such as the speaking rate, are needed. The results show that such common properties are retained across disciplines even on ASR output with a Word Error Rate (WER) of 30%.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126434240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition 结合倒谱归一化与人工耳蜗类语音处理的麦克风阵列语音识别

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424211

Cong-Thanh Do, M. Taghizadeh, Philip N. Garner

{"title":"Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition","authors":"Cong-Thanh Do, M. Taghizadeh, Philip N. Garner","doi":"10.1109/SLT.2012.6424211","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424211","url":null,"abstract":"This paper investigates the combination of cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition. Testing speech signals are recorded by a circular microphone array and are subsequently processed with superdirective beamforming and McCowan post-filtering. Training speech signals, from the multichannel overlapping Number corpus (MONC), are clean and not overlapping. Cochlear implant-like speech processing, which is inspired from the speech processing strategy in cochlear implants, is applied on the training and testing speech signals. Cepstral normalization, including cepstral mean and variance normalization (CMN and CVN), are applied on the training and testing cepstra. Experiments show that implementing either cepstral normalization or cochlear implant-like speech processing helps in reducing the WERs of microphone array-based speech recognition. Combining cepstral normalization and cochlear implant-like speech processing reduces further the WERs, when there is overlapping speech. Train/test mismatches are measured using the Kullback-Leibler divergence (KLD), between the global probability density functions (PDFs) of training and testing cepstral vectors. This measure reveals a train/test mismatch reduction when either cepstral normalization or cochlear implant-like speech processing is used. It reveals also that combining these two processing reduces further the train/test mismatches as well as the WERs.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116829370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A noise-robust speech recognition method composed of weak noise suppression and weak Vector Taylor Series Adaptation 一种由弱噪声抑制和弱矢量泰勒级数自适应组成的噪声鲁棒语音识别方法

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424205

Shuji Komeiji, T. Arakawa, Takafumi Koshinaka

引用次数: 1

Automatic detection and correction of syntax-based prosody annotation errors 基于句法的韵律标注错误自动检测与修正

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424259

Sandrine Brognaux, Thomas Drugman, Richard Beaufort

引用次数: 2

Context-dependent Deep Neural Networks for audio indexing of real-life data 上下文相关的深度神经网络音频索引的现实生活数据

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424212

Gang Li, Huifeng Zhu, G. Cheng, Kit Thambiratnam, Behrooz Chitsaz, Dong Yu, F. Seide

{"title":"Context-dependent Deep Neural Networks for audio indexing of real-life data","authors":"Gang Li, Huifeng Zhu, G. Cheng, Kit Thambiratnam, Behrooz Chitsaz, Dong Yu, F. Seide","doi":"10.1109/SLT.2012.6424212","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424212","url":null,"abstract":"We apply Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, to the real-life problem of audio indexing of data across various sources. Recently, we had shown that on the Switchboard benchmark on speaker-independent transcription of phone calls, CD-DNN-HMMs with 7 hidden layers reduce the word error rate by as much as one-third, compared to discriminatively trained Gaussian-mixture HMMs, and by one-fourth if the GMM-HMM also uses fMPE features. This paper takes CD-DNN-HMM based recognition into a real-life deployment for audio indexing. We find that for our best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data (video podcasts and talks). Compared to a speaker-adaptive GMM system, the relative improvement is 18%, at very similar end-to-end runtime. In system building, we find that DNNs can benefit from a larger number of senones than the GMM-HMM; and that DNN likelihood evaluation is a sizeable runtime factor even in our wide-beam context of generating rich lattices: Cutting the model size by 60% reduces runtime by one-third at a 5% relative WER loss.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132380891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10