2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
The MGB-2 challenge: Arabic multi-dialect broadcast media recognition MGB-2的挑战:阿拉伯语多方言广播媒体识别
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-09-19 DOI: 10.1109/SLT.2016.7846277
Ahmed M. Ali, P. Bell, James R. Glass, Yacine Messaoui, Hamdy Mubarak, S. Renals, Yifan Zhang
{"title":"The MGB-2 challenge: Arabic multi-dialect broadcast media recognition","authors":"Ahmed M. Ali, P. Bell, James R. Glass, Yacine Messaoui, Hamdy Mubarak, S. Renals, Yifan Zhang","doi":"10.1109/SLT.2016.7846277","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846277","url":null,"abstract":"This paper describes the Arabic Multi-Genre Broadcast (MGB-2) Challenge for SLT-2016. Unlike last year's English MGB Challenge, which focused on recognition of diverse TV genres, this year, the challenge has an emphasis on handling the diversity in dialect in Arabic speech. Audio data comes from 19 distinct programmes from the Aljazeera Arabic TV channel between March 2005 and December 2015. Programmes are split into three groups: conversations, interviews, and reports. A total of 1,200 hours have been released with lightly supervised transcriptions for the acoustic modelling. For language modelling, we made available over 110M words crawled from Aljazeera Arabic website Aljazeera.net for a 10 year duration 2000−2011. Two lexicons have been provided, one phoneme based and one grapheme based. Finally, two tasks were proposed for this year's challenge: standard speech transcription, and word alignment. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127086481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 93
Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification 基于长短期记忆的递归神经网络语音增强噪声鲁棒说话人验证
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-09-16 DOI: 10.1109/SLT.2016.7846281
Morten Kolbæk, Z. Tan, J. Jensen
{"title":"Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification","authors":"Morten Kolbæk, Z. Tan, J. Jensen","doi":"10.1109/SLT.2016.7846281","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846281","url":null,"abstract":"In this paper we propose to use a state-of-the-art Deep Recurrent Neural Network (DRNN) based Speech Enhancement (SE) algorithm for noise robust Speaker Verification (SV). Specifically, we study the performance of an i-vector based SV system, when tested in noisy conditions using a DRNN based SE front-end utilizing a Long Short-Term Memory (LSTM) architecture. We make comparisons to systems using a Non-negative Matrix Factorization (NMF) based front-end, and a Short-Time Spectral Amplitude Minimum Mean Square Error (STSA-MMSE) based front-end, respectively. We show in simulation experiments that a male-speaker and text-independent DRNN based SE front-end, without specific a priori knowledge about the noise type outperforms a text, noise type and speaker dependent NMF based front-end as well as a STSA-MMSE based front-end in terms of Equal Error Rates for a large range of noise types and signal to noise ratios on the RSR2015 speech corpus.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127298830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Approaches for language identification in mismatched environments 不匹配环境下的语言识别方法
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-09-08 DOI: 10.1109/SLT.2016.7846286
S. Nercessian, P. Torres-Carrasquillo, Gabriel Martinez-Montes
{"title":"Approaches for language identification in mismatched environments","authors":"S. Nercessian, P. Torres-Carrasquillo, Gabriel Martinez-Montes","doi":"10.1109/SLT.2016.7846286","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846286","url":null,"abstract":"In this paper, we consider the task of language identification in the context of mismatch conditions. Specifically, we address the issue of using unlabeled data in the domain of interest to improve the performance of a state-of-the-art system. The evaluation is performed on a 9-language set that includes data in both conversational telephone speech and narrowband broadcast speech. Multiple experiments are conducted to assess the performance of the system in this condition and a number of alternatives to ameliorate the drop in performance. The best system evaluated is based on deep neural network (DNN) bottleneck features using i-vectors utilizing a combination of all the approaches proposed in this work. The resulting system improved baseline DNN system performance by 30%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125245569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Hierarchical attention model for improved machine comprehension of spoken content 提高机器对口语内容理解的层次注意模型
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-08-28 DOI: 10.1109/SLT.2016.7846270
Wei Fang, Juei-Yang Hsu, Hung-yi Lee, Lin-Shan Lee
{"title":"Hierarchical attention model for improved machine comprehension of spoken content","authors":"Wei Fang, Juei-Yang Hsu, Hung-yi Lee, Lin-Shan Lee","doi":"10.1109/SLT.2016.7846270","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846270","url":null,"abstract":"Multimedia or spoken content presents more attractive information than plain text content, but the former is more difficult to display on a screen and be selected by a user. As a result, accessing large collections of the former is much more difficult and time-consuming than the latter for humans. It's therefore highly attractive to develop machines which can automatically understand spoken content and summarize the key information for humans to browse over. In this endeavor, a new task of machine comprehension of spoken content was proposed recently. The initial goal was defined as the listening comprehension test of TOEFL, a challenging academic English examination for English learners whose native languages are not English. An Attention-based Multi-hop Recurrent Neural Network (AMRNN) architecture was also proposed for this task, which considered only the sequential relationship within the speech utterances. In this paper, we propose a new Hierarchical Attention Model (HAM), which constructs multi-hopped attention mechanism over tree-structured rather than sequential representations for the utterances. Improved comprehension performance robust with respect to ASR errors were obtained.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Median-based generation of synthetic speech durations using a non-parametric approach 使用非参数方法合成语音持续时间的基于中位数的生成
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-08-22 DOI: 10.1109/SLT.2016.7846337
S. Ronanki, O. Watts, Simon King, G. Henter
{"title":"Median-based generation of synthetic speech durations using a non-parametric approach","authors":"S. Ronanki, O. Watts, Simon King, G. Henter","doi":"10.1109/SLT.2016.7846337","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846337","url":null,"abstract":"This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling - which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis - our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115699699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Multi-lingual deep neural networks for language recognition 用于语言识别的多语言深度神经网络
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-08-08 DOI: 10.1109/SLT.2016.7846285
Luis Murphy Marcos, F. Richardson
{"title":"Multi-lingual deep neural networks for language recognition","authors":"Luis Murphy Marcos, F. Richardson","doi":"10.1109/SLT.2016.7846285","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846285","url":null,"abstract":"Multi-lingual feature extraction using bottleneck layers in deep neural networks (BN-DNNs) has been proven to be an effective technique for low resource speech recognition and more recently for language recognition. In this work we investigate the impact on language recognition performance of the multi-lingual BN-DNN architecture and training configurations for the NIST 2011 and 2015 language recognition evaluations (LRE11 and LRE15). The best performing multi-lingual BN-DNN configuration yields relative performance gains of 50% on LRE11 and 40% on LRE15 compared to a standard MFCC/SDC baseline system and 17% on LRE11 and 7% on LRE15 relative to a single language BN-DNN system. Detailed performance analysis using data from all 24 Babel languages, Fisher Spanish and Switchboard English shows the impact of language selection and the amount of training data on overall BN-DNN performance.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128722775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Sequence training and adaptation of highway deep neural networks 公路深度神经网络的序列训练与自适应
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-07-07 DOI: 10.1109/SLT.2016.7846304
Liang Lu
{"title":"Sequence training and adaptation of highway deep neural networks","authors":"Liang Lu","doi":"10.1109/SLT.2016.7846304","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846304","url":null,"abstract":"Highway deep neural network (HDNN) is a type of depth-gated feedforward neural network, which has shown to be easier to train with more hidden layers and also generalise better compared to conventional plain deep neural networks (DNNs). Previously, we investigated a structured HDNN architecture for speech recognition, in which the two gate functions were tied across all the hidden layers, and we were able to train a much smaller model without sacrificing the recognition accuracy. In this paper, we carry on the study of this architecture with sequence-discriminative training criterion and speaker adaptation techniques on the AMI meeting speech recognition corpus. We show that these two techniques improve speech recognition accuracy on top of the model trained with the cross entropy criterion. Furthermore, we demonstrate that the two gate functions that are tied across all the hidden layers are able to control the information flow over the whole network, and we can achieve considerable improvements by only updating these gate functions in both sequence training and adaptation experiments.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130457895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
DialPort: Connecting the spoken dialog research community to real user data DialPort:将口语对话研究社区与真实用户数据连接起来
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-06-08 DOI: 10.1109/SLT.2016.7846249
Tiancheng Zhao, Kyusong Lee, M. Eskénazi
{"title":"DialPort: Connecting the spoken dialog research community to real user data","authors":"Tiancheng Zhao, Kyusong Lee, M. Eskénazi","doi":"10.1109/SLT.2016.7846249","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846249","url":null,"abstract":"This paper describes a new spoken dialog portal that connects systems produced by the spoken dialog academic research community and gives them access to real users. We introduce a distributed, multi-modal, multi-agent prototype dialog framework that affords easy integration with various remote resources, ranging from end-to-end dialog systems to external knowledge APIs. The portal provides seamless passage from one spoken dialog system to another. To date, the DialPort portal has successfully connected to the multi-domain spoken dialog system at Cambridge University, the NOAA (National Oceanic and Atmospheric Administration) weather API and the Yelp API. We present statistics derived from log data gathered during preliminary tests of the portal on the performance of the portal and on the quality (seamlessness) of the transition from one system to another.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116841695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Deep neural network driven mixture of PLDA for robust i-vector speaker verification 基于深度神经网络的混合PLDA鲁棒i向量说话人验证
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 1900-01-01 DOI: 10.1109/SLT.2016.7846263
N. Li, M. Mak, Jen-Tzung Chien
{"title":"Deep neural network driven mixture of PLDA for robust i-vector speaker verification","authors":"N. Li, M. Mak, Jen-Tzung Chien","doi":"10.1109/SLT.2016.7846263","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846263","url":null,"abstract":"In speaker recognition, the mismatch between the enrollment and test utterances due to noise with different signal-to-noise ratios (SNRs) is a great challenge. Based on the observation that noise-level variability causes the i-vectors to form heterogeneous clusters, this paper proposes using an SNR-aware deep neural network (DNN) to guide the training of PLDA mixture models. Specifically, given an i-vector, the SNR posterior probabilities produced by the DNN are used as the posteriors of indicator variables of the mixture model. As a result, the proposed model provides a more reasonable soft division of the i-vector space compared to the conventional mixture of PLDA. During verification, given a test trial, the marginal likelihoods from individual PLDA models are linearly combined by the posterior probabilities of SNR levels computed by the DNN. Experimental results for SNR mismatch tasks based on NIST 2012 SRE suggest that the proposed model is more effective than PLDA and conventional mixture of PLDA for handling heterogeneous corpora.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"516 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123408270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
The fifth dialog state tracking challenge 第五个对话框状态跟踪挑战
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 1900-01-01 DOI: 10.1109/SLT.2016.7846311
Seokhwan Kim, L. F. D’Haro, Rafael E. Banchs, J. Williams, Matthew Henderson, Koichiro Yoshino
{"title":"The fifth dialog state tracking challenge","authors":"Seokhwan Kim, L. F. D’Haro, Rafael E. Banchs, J. Williams, Matthew Henderson, Koichiro Yoshino","doi":"10.1109/SLT.2016.7846311","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846311","url":null,"abstract":"Dialog state tracking - the process of updating the dialog state after each interaction with the user - is a key component of most dialog systems. Following a similar scheme to the fourth dialog state tracking challenge, this edition again focused on human-human dialogs, but introduced the task of cross-lingual adaptation of trackers. The challenge received a total of 32 entries from 9 research groups. In addition, several pilot track evaluations were also proposed receiving a total of 16 entries from 4 groups. In both cases, the results show that most of the groups were able to outperform the provided baselines for each task.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129363724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信