2018 IEEE Spoken Language Technology Workshop (SLT)最新文献_第8页

Improved Conditional Generative Adversarial Net Classification For Spoken Language Recognition 用于口语识别的改进条件生成对抗网络分类

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639522

Xiaoxiao Miao, I. Mcloughlin, Shengyu Yao, Yonghong Yan

{"title":"Improved Conditional Generative Adversarial Net Classification For Spoken Language Recognition","authors":"Xiaoxiao Miao, I. Mcloughlin, Shengyu Yao, Yonghong Yan","doi":"10.1109/SLT.2018.8639522","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639522","url":null,"abstract":"Recent research on generative adversarial nets (GAN) for language identification (LID) has shown promising results. In this paper, we further exploit the latent abilities of GAN networks to firstly combine them with deep neural network (DNN)-based i-vector approaches and then to improve the LID model using conditional generative adversarial net (cGAN) classification. First, phoneme dependent deep bottleneck features (DBF) combined with output posteriors of a pre-trained DNN for automatic speech recognition (ASR) are used to extract i-vectors in the normal way. These i-vectors are then classified using cGAN, and we show an effective method within the cGAN to optimize parameters by combining both language identification and verification signals as supervision. Results show firstly that cGAN methods can significantly outperform DBF DNN i-vector methods where 49-dimensional i-vectors are used, but not where 600-dimensional vectors are used. Secondly, training a cGAN discriminator network for direct classification has further benefit for low dimensional i-vectors as well as short utterances with high dimensional i-vectors. However, incorporating a dedicated discriminator network output layer for classification and optimizing both classification and verification loss brings benefits in all test cases.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133653676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Improved Knowledge Distillation from Bi-Directional to Uni-Directional LSTM CTC for End-to-End Speech Recognition 面向端到端语音识别的双向到单向LSTM CTC改进知识蒸馏

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639629

Gakuto Kurata, Kartik Audhkhasi

引用次数: 38

Corpus and Annotation Towards NLU for Customer Ordering Dialogs 面向客户订购对话的NLU语料库和注释

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639605

John Chen, R. Prasad, Svetlana Stoyanchev, Ethan Selfridge, S. Bangalore, Michael Johnston

引用次数: 5

Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance 基于音译的改进码交换语音识别性能的方法

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639699

Jesse Emond, B. Ramabhadran, Brian Roark, P. Moreno, Min Ma

{"title":"Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance","authors":"Jesse Emond, B. Ramabhadran, Brian Roark, P. Moreno, Min Ma","doi":"10.1109/SLT.2018.8639699","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639699","url":null,"abstract":"Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"286 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130825756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS 半监督学习的日英语码转换ASR和TTS语音链

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639674

Sahoko Nakayama, Andros Tjandra, S. Sakti, Satoshi Nakamura

{"title":"Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS","authors":"Sahoko Nakayama, Andros Tjandra, S. Sakti, Satoshi Nakamura","doi":"10.1109/SLT.2018.8639674","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639674","url":null,"abstract":"Code-switching (CS) speech, in which speakers alternate between two or more languages in the same utterance, often occurs in multilingual communities. Such a phenomenon poses challenges for spoken language technologies: automatic speech recognition (ASR) and text-to-speech synthesis (TTS), since the systems need to be able to handle the input in a multilingual setting. We may find code-switching text or code-switching speech in social media, but parallel speech and the transcriptions of code-switching data, which are suitable for training ASR and TTS, are generally unavailable. In this paper, we utilize a speech chain framework based on deep learning to enable ASR and TTS to learn code-switching in a semi-supervised fashion. We base our system on Japanese-English conversational speech. We first separately train the ASR and TTS systems with parallel speech-text of monolingual data (supervised learning) and perform a speech chain with only code-switching text or code-switching speech (unsupervised learning). Experimental results reveal that such closed-loop architecture allows ASR and TTS to learn from each other and improve the performance even without any parallel code-switching data.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129525895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Deep View2View Mapping for View-Invariant Lipreading 用于视图不变读的深度View2View映射

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639698

Alexandros Koumparoulis, G. Potamianos

{"title":"Deep View2View Mapping for View-Invariant Lipreading","authors":"Alexandros Koumparoulis, G. Potamianos","doi":"10.1109/SLT.2018.8639698","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639698","url":null,"abstract":"Recently, visual-only and audio-visual speech recognition have made significant progress thanks to deep-learning based, trainable visual front-ends (VFEs), with most research focusing on frontal or near-frontal face videos. In this paper, we seek to expand the applicability of VFEs targeted on frontal face views to non-frontal ones, without making assumptions on the VFE type, and allowing systems trained on frontal-view data to be applied on mismatched, non-frontal videos. For this purpose, we adapt the “pix2pix” model, recently proposed for image translation tasks, to transform non-frontal speaker mouth regions to frontal, employing a convolutional neural network architecture, which we call “view2view”. We develop our approach on the OuluVS2 multiview lipreading dataset, allowing training of four such networks that map views at predefined non-frontal angles (up to profile) to frontal ones, which we subsequently feed to a frontal-view VFE. We compare the “view2view” network against a baseline that performs linear cross-view regression at the VFE space. Results on visual-only, as well as audio-visual automatic speech recognition over multiple acoustic noise conditions, demonstrate that the “view2view” significantly outperforms the baseline, narrowing the performance gap from an ideal, matched scenario of view-specific systems. Improvements are retained when the approach is coupled with an automatic view estimator.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

First-Pass Techniques for Very Large Vocabulary Speech Recognition ff Morphologically Rich Languages 词形丰富语言的超大词汇量语音识别的首次通过技术

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639691

Matti Varjokallio, Sami Virpioja, M. Kurimo

{"title":"First-Pass Techniques for Very Large Vocabulary Speech Recognition ff Morphologically Rich Languages","authors":"Matti Varjokallio, Sami Virpioja, M. Kurimo","doi":"10.1109/SLT.2018.8639691","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639691","url":null,"abstract":"In speech recognition of morphologically rich languages, very large vocabulary sizes are required to achieve good error rates. Especially traditional n-gram language models trained over word sequences suffer from data sparsity issues. The language modelling can often be improved by segmenting the words to sequences of subword units that are more frequent. Another solution is to cluster the words into classes and apply a class-based language model. We show that linearly interpolating n-gram models trained over words, subwords, and word classes improves the first-pass speech recognition accuracy in very large vocabulary speech recognition tasks for two morphologically rich and agglutinative languages, Finnish and Estonian. To overcome performance issues, we also introduce a novel language model look-ahead method utilizing a class bigram model. The method improves the results over a unigram look-ahead model with the same recognition speed, the difference increasing for small real-time factors. The improved model combination and look-ahead model are useful in cases where real-time recognition is required or when the improved hypotheses help with further recognition passes. For instance, neural network language models are mostly applied by rescoring the generated hypotheses due to higher computational costs.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123044004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Combining End-to-End and Adversarial Training for Low-Resource Speech Recognition 结合端到端和对抗训练的低资源语音识别

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639541

Jennifer Drexler, James R. Glass

引用次数: 23

Sentiment Classification on Erroneous ASR Transcripts: A Multi View Learning Approach 错误ASR转录物的情感分类:一种多视角学习方法

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639665

Sri Harsha Dumpala, I. Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu

引用次数: 7

Scalable Language Model Adaptation for Spoken Dialogue Systems 口语对话系统的可扩展语言模型适应

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639663

Ankur Gandhe, A. Rastrow, Björn Hoffmeister

引用次数: 21