2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
BBN technologies' OpenSAD system BBN technologies的OpenSAD系统
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846238
Scott Novotney, D. Karakos, J. Silovský, R. Schwartz
{"title":"BBN technologies' OpenSAD system","authors":"Scott Novotney, D. Karakos, J. Silovský, R. Schwartz","doi":"10.1109/SLT.2016.7846238","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846238","url":null,"abstract":"We describe our submission to the NIST OpenSAD evaluation of speech activity detection of noisy audio generated by the DARPA RATS program. With frequent transmission degradation, channel interference and other noises added, simple energy thresholds do a poor job at SAD for this audio. The evaluation measured performance on both in-training and novel channels. Our approach used a system combination of feed-forward neural networks and bidirectional LSTM recurrent neural networks. System combination and unsupervised adaptation provided further gains on novel channels that lack training data. These improvements lead to a 26% relative improvement for novel channels over simple decoding. Our system resulted in the lowest error rate on the in-training channels and second on the out-of-training channels.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125855402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automated structure discovery and parameter tuning of neural network language model based on evolution strategy 基于进化策略的神经网络语言模型自动结构发现与参数整定
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846334
Tomohiro Tanaka, Takafumi Moriya, T. Shinozaki, Shinji Watanabe, Takaaki Hori, Kevin Duh
{"title":"Automated structure discovery and parameter tuning of neural network language model based on evolution strategy","authors":"Tomohiro Tanaka, Takafumi Moriya, T. Shinozaki, Shinji Watanabe, Takaaki Hori, Kevin Duh","doi":"10.1109/SLT.2016.7846334","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846334","url":null,"abstract":"Long short-term memory (LSTM) recurrent neural network based language models are known to improve speech recognition performance. However, significant effort is required to optimize network structures and training configurations. In this study, we automate the development process using evolutionary algorithms. In particular, we apply the covariance matrix adaptation-evolution strategy (CMA-ES), which has demonstrated robustness in other black box hyper-parameter optimization problems. By flexibly allowing optimization of various meta-parameters including layer wise unit types, our method automatically finds a configuration that gives improved recognition performance. Further, by using a Pareto based multi-objective CMA-ES, both WER and computational time were reduced jointly: after 10 generations, relative WER and computational time reductions for decoding were 4.1% and 22.7% respectively, compared to an initial baseline system whose WER was 8.7%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114425383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Automated optimization of decoder hyper-parameters for online LVCSR 在线LVCSR解码器超参数的自动优化
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846303
Akshay Chandrashekaran, Ian Lane
{"title":"Automated optimization of decoder hyper-parameters for online LVCSR","authors":"Akshay Chandrashekaran, Ian Lane","doi":"10.1109/SLT.2016.7846303","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846303","url":null,"abstract":"In this paper, we explore the usage of automated hyper-parameter optimization techniques with scalarization of multiple objectives to find decoder hyper-parameters suitable for a given acoustic and language model for an LVCSR task. We compare manual optimization, random sampling, tree of Parzen estimators, Bayesian Optimization, and genetic algorithm to find a technique that yields better performance than manual optimization in a comparable number of hyper-parameter evaluations. We focus on a scalar combination of word error rate (WER), log of real time factor (logRTF), and peak memory usage, formulated using the augmented Tchebyscheff function(ATF), as the objective function for the automated techniques. For this task, with a constraint on the maximum number of objective evaluations, we find that the best automated optimization technique: Bayesian Optimization outperforms manual optimization by 8% in terms of ATF. We find that memory usage was not a very useful distinguishing factor between different hyper-parameter settings, with trade-offs occurring between RTF and WER a majority of the time. We also try to perform optimization of WER with a hard constraint on the real time factor of 0.1. In this case, performing constrained Bayesian Optimization yields a model that provides an improvement of 2.7% over the best model obtained from manual optimization with 60% the number of evaluations.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125964616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Parallel Long Short-Term Memory for multi-stream classification 多流分类的并行长短期记忆
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846268
Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès, R. Mori
{"title":"Parallel Long Short-Term Memory for multi-stream classification","authors":"Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès, R. Mori","doi":"10.1109/SLT.2016.7846268","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846268","url":null,"abstract":"Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells allow these DNN-based models to manage long-term dependencies such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). Nevertheless, these RNNs process a single input stream in one (LSTM) or two (Bidirectional LSTM) directions. But most of the information available nowadays is from multistreams or multimedia documents, and require RNNs to process these information synchronously during the training. This paper presents an original LSTM-based architecture, named Parallel LSTM (PLSTM), that carries out multiple parallel synchronized input sequences in order to predict a common output. The proposed PLSTM method could be used for parallel sequence classification purposes. The PLSTM approach is evaluated on an automatic telecast genre sequences classification task and compared with different state-of-the-art architectures. Results show that the proposed PLSTM method outperforms the baseline n-gram models as well as the state-of-the-art LSTM approach.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice 语料库大小和内容对单元选择MaryTTS语音感知质量的影响
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846336
Florian Hinterleitner, Benjamin Weiss, S. Möller
{"title":"Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice","authors":"Florian Hinterleitner, Benjamin Weiss, S. Möller","doi":"10.1109/SLT.2016.7846336","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846336","url":null,"abstract":"State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114039200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic plagiarism detection for spoken responses in an assessment of English language proficiency 英语语言能力评估中口语回答的自动抄袭检测
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846254
Xinhao Wang, Keelan Evanini, James V. Bruno, Matthew David Mulholland
{"title":"Automatic plagiarism detection for spoken responses in an assessment of English language proficiency","authors":"Xinhao Wang, Keelan Evanini, James V. Bruno, Matthew David Mulholland","doi":"10.1109/SLT.2016.7846254","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846254","url":null,"abstract":"This paper addresses the task of automatically detecting plagiarized responses in the context of a test of spoken English proficiency for non-native speakers. Text-to-text content similarity features are used jointly with speaking proficiency features extracted using an automated speech scoring system to train classifiers to distinguish between plagiarized and non-plagiarized spoken responses. A large data set drawn from an operational English proficiency assessment is used to simulate the performance of the detection system in a practical application. The best classifier on this heavily imbalanced data set resulted in an F1-score of 0.706 on the plagiarized class. These results indicate that the proposed system can potentially be used to improve the validity of both human and automated assessment of non-native spoken English.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129321592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Robust utterance classification using multiple classifiers in the presence of speech recognition errors 基于多分类器的语音识别错误鲁棒性语音分类
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846291
Takeshi Homma, Kazuaki Shima, Takuya Matsumoto
{"title":"Robust utterance classification using multiple classifiers in the presence of speech recognition errors","authors":"Takeshi Homma, Kazuaki Shima, Takuya Matsumoto","doi":"10.1109/SLT.2016.7846291","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846291","url":null,"abstract":"In order to achieve an utterance classifier that not only works robustly against speech recognition errors but also maintains high accuracy for input with no errors, we propose the following techniques. First, we propose a classifier training method in which not only error-free transcriptions but also recognized sentences with errors were used as training data. To maintain high accuracy whether or not input has recognition errors, we adjusted a scaling factor of the number of transcriptions for training data. Second, we introduced three classifiers that utilize different input features: words, phonemes, and words recovered from phonetic recognition errors. We also introduced a selection method that selects the most probable utterance class from outputs of multiple utterance classifiers using recognition results obtained from enhanced and non-enhanced speech signals. Experimental results showed our method cuts 55% of classification errors for speech recognition input while accuracy degradation rate for transcription input is 0.7%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129136619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic turn segmentation for Movie & TV subtitles 自动转向分割电影和电视字幕
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846272
Pierre Lison, R. Meena
{"title":"Automatic turn segmentation for Movie & TV subtitles","authors":"Pierre Lison, R. Meena","doi":"10.1109/SLT.2016.7846272","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846272","url":null,"abstract":"Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material - although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78 % on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117240134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
End-to-End attention based text-dependent speaker verification 基于端到端注意的文本依赖说话人验证
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846261
Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Y. Gong
{"title":"End-to-End attention based text-dependent speaker verification","authors":"Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Y. Gong","doi":"10.1109/SLT.2016.7846261","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846261","url":null,"abstract":"A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114651716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 173
Attribute based shared hidden layers for cross-language knowledge transfer 基于属性的跨语言知识传递共享隐藏层
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846327
Vipul Arora, A. Lahiri, Henning Reetz
{"title":"Attribute based shared hidden layers for cross-language knowledge transfer","authors":"Vipul Arora, A. Lahiri, Henning Reetz","doi":"10.1109/SLT.2016.7846327","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846327","url":null,"abstract":"Deep neural network (DNN) acoustic models can be adapted to under-resourced languages by transferring the hidden layers. An analogous transfer problem is popular as few-shot learning to recognise scantily seen objects based on their meaningful attributes. In similar way, this paper proposes a principled way to represent the hidden layers of DNN in terms of attributes shared across languages. The diverse phoneme sets of different languages can be represented in terms of phonological features that are shared by them. The DNN layers estimating these features could then be transferred in a meaningful and reliable way. Here, we evaluate model transfer from English to German, by comparing the proposed method with other popular methods on the task of phoneme recognition. Experimental results support that apart from providing interpretability to the DNN acoustic models, the proposed framework provides efficient means for their speedy adaptation to different languages, even in the face of scanty adaptation data.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125596040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信