Scott Novotney, D. Karakos, J. Silovský, R. Schwartz
{"title":"BBN technologies' OpenSAD system","authors":"Scott Novotney, D. Karakos, J. Silovský, R. Schwartz","doi":"10.1109/SLT.2016.7846238","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846238","url":null,"abstract":"We describe our submission to the NIST OpenSAD evaluation of speech activity detection of noisy audio generated by the DARPA RATS program. With frequent transmission degradation, channel interference and other noises added, simple energy thresholds do a poor job at SAD for this audio. The evaluation measured performance on both in-training and novel channels. Our approach used a system combination of feed-forward neural networks and bidirectional LSTM recurrent neural networks. System combination and unsupervised adaptation provided further gains on novel channels that lack training data. These improvements lead to a 26% relative improvement for novel channels over simple decoding. Our system resulted in the lowest error rate on the in-training channels and second on the out-of-training channels.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125855402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomohiro Tanaka, Takafumi Moriya, T. Shinozaki, Shinji Watanabe, Takaaki Hori, Kevin Duh
{"title":"Automated structure discovery and parameter tuning of neural network language model based on evolution strategy","authors":"Tomohiro Tanaka, Takafumi Moriya, T. Shinozaki, Shinji Watanabe, Takaaki Hori, Kevin Duh","doi":"10.1109/SLT.2016.7846334","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846334","url":null,"abstract":"Long short-term memory (LSTM) recurrent neural network based language models are known to improve speech recognition performance. However, significant effort is required to optimize network structures and training configurations. In this study, we automate the development process using evolutionary algorithms. In particular, we apply the covariance matrix adaptation-evolution strategy (CMA-ES), which has demonstrated robustness in other black box hyper-parameter optimization problems. By flexibly allowing optimization of various meta-parameters including layer wise unit types, our method automatically finds a configuration that gives improved recognition performance. Further, by using a Pareto based multi-objective CMA-ES, both WER and computational time were reduced jointly: after 10 generations, relative WER and computational time reductions for decoding were 4.1% and 22.7% respectively, compared to an initial baseline system whose WER was 8.7%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114425383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated optimization of decoder hyper-parameters for online LVCSR","authors":"Akshay Chandrashekaran, Ian Lane","doi":"10.1109/SLT.2016.7846303","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846303","url":null,"abstract":"In this paper, we explore the usage of automated hyper-parameter optimization techniques with scalarization of multiple objectives to find decoder hyper-parameters suitable for a given acoustic and language model for an LVCSR task. We compare manual optimization, random sampling, tree of Parzen estimators, Bayesian Optimization, and genetic algorithm to find a technique that yields better performance than manual optimization in a comparable number of hyper-parameter evaluations. We focus on a scalar combination of word error rate (WER), log of real time factor (logRTF), and peak memory usage, formulated using the augmented Tchebyscheff function(ATF), as the objective function for the automated techniques. For this task, with a constraint on the maximum number of objective evaluations, we find that the best automated optimization technique: Bayesian Optimization outperforms manual optimization by 8% in terms of ATF. We find that memory usage was not a very useful distinguishing factor between different hyper-parameter settings, with trade-offs occurring between RTF and WER a majority of the time. We also try to perform optimization of WER with a hard constraint on the real time factor of 0.1. In this case, performing constrained Bayesian Optimization yields a model that provides an improvement of 2.7% over the best model obtained from manual optimization with 60% the number of evaluations.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125964616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès, R. Mori
{"title":"Parallel Long Short-Term Memory for multi-stream classification","authors":"Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès, R. Mori","doi":"10.1109/SLT.2016.7846268","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846268","url":null,"abstract":"Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells allow these DNN-based models to manage long-term dependencies such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). Nevertheless, these RNNs process a single input stream in one (LSTM) or two (Bidirectional LSTM) directions. But most of the information available nowadays is from multistreams or multimedia documents, and require RNNs to process these information synchronously during the training. This paper presents an original LSTM-based architecture, named Parallel LSTM (PLSTM), that carries out multiple parallel synchronized input sequences in order to predict a common output. The proposed PLSTM method could be used for parallel sequence classification purposes. The PLSTM approach is evaluated on an automatic telecast genre sequences classification task and compared with different state-of-the-art architectures. Results show that the proposed PLSTM method outperforms the baseline n-gram models as well as the state-of-the-art LSTM approach.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice","authors":"Florian Hinterleitner, Benjamin Weiss, S. Möller","doi":"10.1109/SLT.2016.7846336","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846336","url":null,"abstract":"State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114039200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinhao Wang, Keelan Evanini, James V. Bruno, Matthew David Mulholland
{"title":"Automatic plagiarism detection for spoken responses in an assessment of English language proficiency","authors":"Xinhao Wang, Keelan Evanini, James V. Bruno, Matthew David Mulholland","doi":"10.1109/SLT.2016.7846254","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846254","url":null,"abstract":"This paper addresses the task of automatically detecting plagiarized responses in the context of a test of spoken English proficiency for non-native speakers. Text-to-text content similarity features are used jointly with speaking proficiency features extracted using an automated speech scoring system to train classifiers to distinguish between plagiarized and non-plagiarized spoken responses. A large data set drawn from an operational English proficiency assessment is used to simulate the performance of the detection system in a practical application. The best classifier on this heavily imbalanced data set resulted in an F1-score of 0.706 on the plagiarized class. These results indicate that the proposed system can potentially be used to improve the validity of both human and automated assessment of non-native spoken English.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129321592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust utterance classification using multiple classifiers in the presence of speech recognition errors","authors":"Takeshi Homma, Kazuaki Shima, Takuya Matsumoto","doi":"10.1109/SLT.2016.7846291","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846291","url":null,"abstract":"In order to achieve an utterance classifier that not only works robustly against speech recognition errors but also maintains high accuracy for input with no errors, we propose the following techniques. First, we propose a classifier training method in which not only error-free transcriptions but also recognized sentences with errors were used as training data. To maintain high accuracy whether or not input has recognition errors, we adjusted a scaling factor of the number of transcriptions for training data. Second, we introduced three classifiers that utilize different input features: words, phonemes, and words recovered from phonetic recognition errors. We also introduced a selection method that selects the most probable utterance class from outputs of multiple utterance classifiers using recognition results obtained from enhanced and non-enhanced speech signals. Experimental results showed our method cuts 55% of classification errors for speech recognition input while accuracy degradation rate for transcription input is 0.7%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129136619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic turn segmentation for Movie & TV subtitles","authors":"Pierre Lison, R. Meena","doi":"10.1109/SLT.2016.7846272","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846272","url":null,"abstract":"Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material - although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78 % on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117240134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Y. Gong
{"title":"End-to-End attention based text-dependent speaker verification","authors":"Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Y. Gong","doi":"10.1109/SLT.2016.7846261","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846261","url":null,"abstract":"A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114651716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attribute based shared hidden layers for cross-language knowledge transfer","authors":"Vipul Arora, A. Lahiri, Henning Reetz","doi":"10.1109/SLT.2016.7846327","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846327","url":null,"abstract":"Deep neural network (DNN) acoustic models can be adapted to under-resourced languages by transferring the hidden layers. An analogous transfer problem is popular as few-shot learning to recognise scantily seen objects based on their meaningful attributes. In similar way, this paper proposes a principled way to represent the hidden layers of DNN in terms of attributes shared across languages. The diverse phoneme sets of different languages can be represented in terms of phonological features that are shared by them. The DNN layers estimating these features could then be transferred in a meaningful and reliable way. Here, we evaluate model transfer from English to German, by comparing the proposed method with other popular methods on the task of phoneme recognition. Experimental results support that apart from providing interpretability to the DNN acoustic models, the proposed framework provides efficient means for their speedy adaptation to different languages, even in the face of scanty adaptation data.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125596040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}