2018 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Improved Conditional Generative Adversarial Net Classification For Spoken Language Recognition 用于口语识别的改进条件生成对抗网络分类
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639522
Xiaoxiao Miao, I. Mcloughlin, Shengyu Yao, Yonghong Yan
{"title":"Improved Conditional Generative Adversarial Net Classification For Spoken Language Recognition","authors":"Xiaoxiao Miao, I. Mcloughlin, Shengyu Yao, Yonghong Yan","doi":"10.1109/SLT.2018.8639522","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639522","url":null,"abstract":"Recent research on generative adversarial nets (GAN) for language identification (LID) has shown promising results. In this paper, we further exploit the latent abilities of GAN networks to firstly combine them with deep neural network (DNN)-based i-vector approaches and then to improve the LID model using conditional generative adversarial net (cGAN) classification. First, phoneme dependent deep bottleneck features (DBF) combined with output posteriors of a pre-trained DNN for automatic speech recognition (ASR) are used to extract i-vectors in the normal way. These i-vectors are then classified using cGAN, and we show an effective method within the cGAN to optimize parameters by combining both language identification and verification signals as supervision. Results show firstly that cGAN methods can significantly outperform DBF DNN i-vector methods where 49-dimensional i-vectors are used, but not where 600-dimensional vectors are used. Secondly, training a cGAN discriminator network for direct classification has further benefit for low dimensional i-vectors as well as short utterances with high dimensional i-vectors. However, incorporating a dedicated discriminator network output layer for classification and optimizing both classification and verification loss brings benefits in all test cases.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133653676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Improved Knowledge Distillation from Bi-Directional to Uni-Directional LSTM CTC for End-to-End Speech Recognition 面向端到端语音识别的双向到单向LSTM CTC改进知识蒸馏
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639629
Gakuto Kurata, Kartik Audhkhasi
{"title":"Improved Knowledge Distillation from Bi-Directional to Uni-Directional LSTM CTC for End-to-End Speech Recognition","authors":"Gakuto Kurata, Kartik Audhkhasi","doi":"10.1109/SLT.2018.8639629","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639629","url":null,"abstract":"End-to-end automatic speech recognition (ASR) promises to simplify model training and deployment. Most end-to-end ASR systems utilize a bi-directional Long Short-Term Memory (BiLSTM) acoustic model due to its ability to capture acoustic context from the entire utterance. However, BiLSTM models have high latency and cannot be used in streaming applications. Leveraging knowledge distillation to train a low-latency end-to-end uni-directional LSTM (UniLSTM) model from a BiLSTM model can be an option. However, it makes the strict assumption of shared frame-wise time alignments between the two models. We propose an improved knowledge distillation algorithm that relaxes this assumption and improves the accuracy of the UniLSTM model. We confirmed the advantage of the proposed method on a standard English conversational telephone speech recognition task.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131517807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Corpus and Annotation Towards NLU for Customer Ordering Dialogs 面向客户订购对话的NLU语料库和注释
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639605
John Chen, R. Prasad, Svetlana Stoyanchev, Ethan Selfridge, S. Bangalore, Michael Johnston
{"title":"Corpus and Annotation Towards NLU for Customer Ordering Dialogs","authors":"John Chen, R. Prasad, Svetlana Stoyanchev, Ethan Selfridge, S. Bangalore, Michael Johnston","doi":"10.1109/SLT.2018.8639605","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639605","url":null,"abstract":"Ordering products and services through virtual agents is possible but suffers limitations on the kind of ordering that is possible or on the naturalness of the conversation. We address these limitations by collecting a corpus of human-human dialogs in the food ordering domain. We create a food focused annotation scheme that is tailored for this corpus but customizable for other applications. After annotating the corpus, we find corpus characteristics that may make it more natural, such as complexity of food item mentions and use of multiple intent utterances. Furthermore, we train and evaluate preliminary statistical item and intent models using the annotated corpus.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124818152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance 基于音译的改进码交换语音识别性能的方法
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639699
Jesse Emond, B. Ramabhadran, Brian Roark, P. Moreno, Min Ma
{"title":"Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance","authors":"Jesse Emond, B. Ramabhadran, Brian Roark, P. Moreno, Min Ma","doi":"10.1109/SLT.2018.8639699","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639699","url":null,"abstract":"Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"286 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130825756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS 半监督学习的日英语码转换ASR和TTS语音链
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639674
Sahoko Nakayama, Andros Tjandra, S. Sakti, Satoshi Nakamura
{"title":"Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS","authors":"Sahoko Nakayama, Andros Tjandra, S. Sakti, Satoshi Nakamura","doi":"10.1109/SLT.2018.8639674","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639674","url":null,"abstract":"Code-switching (CS) speech, in which speakers alternate between two or more languages in the same utterance, often occurs in multilingual communities. Such a phenomenon poses challenges for spoken language technologies: automatic speech recognition (ASR) and text-to-speech synthesis (TTS), since the systems need to be able to handle the input in a multilingual setting. We may find code-switching text or code-switching speech in social media, but parallel speech and the transcriptions of code-switching data, which are suitable for training ASR and TTS, are generally unavailable. In this paper, we utilize a speech chain framework based on deep learning to enable ASR and TTS to learn code-switching in a semi-supervised fashion. We base our system on Japanese-English conversational speech. We first separately train the ASR and TTS systems with parallel speech-text of monolingual data (supervised learning) and perform a speech chain with only code-switching text or code-switching speech (unsupervised learning). Experimental results reveal that such closed-loop architecture allows ASR and TTS to learn from each other and improve the performance even without any parallel code-switching data.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129525895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Deep View2View Mapping for View-Invariant Lipreading 用于视图不变读的深度View2View映射
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639698
Alexandros Koumparoulis, G. Potamianos
{"title":"Deep View2View Mapping for View-Invariant Lipreading","authors":"Alexandros Koumparoulis, G. Potamianos","doi":"10.1109/SLT.2018.8639698","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639698","url":null,"abstract":"Recently, visual-only and audio-visual speech recognition have made significant progress thanks to deep-learning based, trainable visual front-ends (VFEs), with most research focusing on frontal or near-frontal face videos. In this paper, we seek to expand the applicability of VFEs targeted on frontal face views to non-frontal ones, without making assumptions on the VFE type, and allowing systems trained on frontal-view data to be applied on mismatched, non-frontal videos. For this purpose, we adapt the “pix2pix” model, recently proposed for image translation tasks, to transform non-frontal speaker mouth regions to frontal, employing a convolutional neural network architecture, which we call “view2view”. We develop our approach on the OuluVS2 multiview lipreading dataset, allowing training of four such networks that map views at predefined non-frontal angles (up to profile) to frontal ones, which we subsequently feed to a frontal-view VFE. We compare the “view2view” network against a baseline that performs linear cross-view regression at the VFE space. Results on visual-only, as well as audio-visual automatic speech recognition over multiple acoustic noise conditions, demonstrate that the “view2view” significantly outperforms the baseline, narrowing the performance gap from an ideal, matched scenario of view-specific systems. Improvements are retained when the approach is coupled with an automatic view estimator.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
First-Pass Techniques for Very Large Vocabulary Speech Recognition ff Morphologically Rich Languages 词形丰富语言的超大词汇量语音识别的首次通过技术
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639691
Matti Varjokallio, Sami Virpioja, M. Kurimo
{"title":"First-Pass Techniques for Very Large Vocabulary Speech Recognition ff Morphologically Rich Languages","authors":"Matti Varjokallio, Sami Virpioja, M. Kurimo","doi":"10.1109/SLT.2018.8639691","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639691","url":null,"abstract":"In speech recognition of morphologically rich languages, very large vocabulary sizes are required to achieve good error rates. Especially traditional n-gram language models trained over word sequences suffer from data sparsity issues. The language modelling can often be improved by segmenting the words to sequences of subword units that are more frequent. Another solution is to cluster the words into classes and apply a class-based language model. We show that linearly interpolating n-gram models trained over words, subwords, and word classes improves the first-pass speech recognition accuracy in very large vocabulary speech recognition tasks for two morphologically rich and agglutinative languages, Finnish and Estonian. To overcome performance issues, we also introduce a novel language model look-ahead method utilizing a class bigram model. The method improves the results over a unigram look-ahead model with the same recognition speed, the difference increasing for small real-time factors. The improved model combination and look-ahead model are useful in cases where real-time recognition is required or when the improved hypotheses help with further recognition passes. For instance, neural network language models are mostly applied by rescoring the generated hypotheses due to higher computational costs.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123044004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Combining End-to-End and Adversarial Training for Low-Resource Speech Recognition 结合端到端和对抗训练的低资源语音识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639541
Jennifer Drexler, James R. Glass
{"title":"Combining End-to-End and Adversarial Training for Low-Resource Speech Recognition","authors":"Jennifer Drexler, James R. Glass","doi":"10.1109/SLT.2018.8639541","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639541","url":null,"abstract":"In this paper, we develop an end-to-end automatic speech recognition (ASR) model designed for a common low-resource scenario: no pronunciation dictionary or phonemic transcripts, very limited transcribed speech, and much larger non-parallel text and speech corpora. Our semi-supervised model is built on top of an encoder-decoder model with attention and takes advantage of non-parallel speech and text corpora in several ways: a denoising text autoencoder that shares parameters with the ASR decoder, a speech autoencoder that shares parameters with the ASR encoder, and adversarial training that encourages the speech and text encoders to use the same embedding space. We show that a model with this architecture significantly outperforms the baseline in this low-resource condition. We additionally perform an ablation evaluation, demonstrating that all of our added components contribute substantially to the overall performance of our model. We propose several avenues for further work, noting in particular that a model with this architecture could potentially enable fully unsupervised speech recognition.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129047426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Sentiment Classification on Erroneous ASR Transcripts: A Multi View Learning Approach 错误ASR转录物的情感分类:一种多视角学习方法
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639665
Sri Harsha Dumpala, I. Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu
{"title":"Sentiment Classification on Erroneous ASR Transcripts: A Multi View Learning Approach","authors":"Sri Harsha Dumpala, I. Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu","doi":"10.1109/SLT.2018.8639665","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639665","url":null,"abstract":"Sentiment classification on spoken language transcriptions has received less attention. A practical system employing the spoken language modality will have to use a language transcription from an Automatic Speech Recognition (ASR) engine which is inherently prone to errors. The main interest of this paper lies in improvement of sentiment classification on erroneous ASR transcriptions. Our aim is to improve the representation of the ASR transcripts using the manual transcripts and other modalities, like audio and visual, that are available during training but not necessarily during test conditions. We adopt an approach based on Deep Canonical Correlation Analysis (DCCA) and propose two new extensions of DCCA to enhance the ASR view using multiple modalities. We present a detailed evaluation of the performance of our approach on datasets of opinion videos (CMU-MOSI and CMU-MOSEI) collected from Youtube.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128379102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Scalable Language Model Adaptation for Spoken Dialogue Systems 口语对话系统的可扩展语言模型适应
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639663
Ankur Gandhe, A. Rastrow, Björn Hoffmeister
{"title":"Scalable Language Model Adaptation for Spoken Dialogue Systems","authors":"Ankur Gandhe, A. Rastrow, Björn Hoffmeister","doi":"10.1109/SLT.2018.8639663","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639663","url":null,"abstract":"Language models (LM) for interactive speech recognition systems are trained on large amounts of data and the model parameters are optimized on past user data. New application intents and interaction types are released for these systems over time, imposing challenges to adapt the LMs since the existing training data is no longer sufficient to model the future user interactions. It is unclear how to adapt LMs to new application intents without degrading the performance on existing applications. In this paper, we propose a solution to (a) estimate n-gram counts directly from the hand-written grammar for training LMs and (b) use constrained optimization to optimize the system parameters for future use cases, while not degrading the performance on past usage. We evaluated our approach on new applications intents for a personal assistant system and find that the adaptation improves the word error rate by up to 15% on new applications even when there is no adaptation data available for an application.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115354726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信