2018 IEEE Spoken Language Technology Workshop (SLT)最新文献_第9页

Turn-Taking Predictions across Languages and Genres Using an LSTM Recurrent Neural Network 使用LSTM递归神经网络进行跨语言和体裁的轮转预测

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639673

Nigel G. Ward, Diego Aguirre, Gerardo Cervantes, O. Fuentes

引用次数: 18

Quaternion Convolutional Neural Networks For Theme Identification Of Telephone Conversations 电话会话主题识别的四元数卷积神经网络

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639676

Titouan Parcollet, Mohamed Morchid, G. Linarès, R. Mori

引用次数: 7

[Title page] (标题页)

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/slt.2018.8639512

引用次数: 0

A New Timit Benchmark for Context-Independent Phone Recognition Using Turbo Fusion 基于Turbo融合的上下文无关手机识别新极限基准

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639670

Timo Lohrenz, Wei Li, T. Fingscheidt

引用次数: 4

Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition 利用序列到序列的语音合成增强声到词的语音识别

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639589

M. Mimura, Sei Ueno, H. Inaguma, S. Sakai, Tatsuya Kawahara

{"title":"Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition","authors":"M. Mimura, Sei Ueno, H. Inaguma, S. Sakai, Tatsuya Kawahara","doi":"10.1109/SLT.2018.8639589","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639589","url":null,"abstract":"Encoder-decoder models for acoustic-to-word (A2W) automatic speech recognition (ASR) are attractive for their simplicity of architecture and run-time latency while achieving state-of-the-art performances. However, word-based models commonly suffer from the-of-vocabulary (OOV) word problem. They also cannot leverage text data to improve their language modeling capability. Recently, sequence-to-sequence neural speech synthesis models trainable from corpora have been developed and shown to achieve naturalness com- parable to recorded human speech. In this paper, we explore how we can leverage the current speech synthesis technology to tailor the ASR system for a target domain by preparing only a relevant text corpus. From a set of target domain texts, we generate speech features using a sequence-to-sequence speech synthesizer. These artificial speech features together with real speech features from conventional speech corpora are used to train an attention-based A2W model. Experimental results show that the proposed approach improves the word accuracy significantly compared to the baseline trained only with the real speech, although synthetic part of the training data comes only from a single female speaker voice.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114347005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Investigation of Users’ Short Responses in Actual Conversation System and Automatic Recognition of their Intentions 实际会话系统中用户简短反应及其意图自动识别研究

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639523

Katsuya Yokoyama, Hiroaki Takatsu, Hiroshi Honda, S. Fujie, Tetsunori Kobayashi

{"title":"Investigation of Users’ Short Responses in Actual Conversation System and Automatic Recognition of their Intentions","authors":"Katsuya Yokoyama, Hiroaki Takatsu, Hiroshi Honda, S. Fujie, Tetsunori Kobayashi","doi":"10.1109/SLT.2018.8639523","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639523","url":null,"abstract":"In human-human conversations, listeners often convey intentions to speakers through feedback consisting of reflexive short responses. The speakers recognize these intentions and change the conversational plans to make communication more efficient. These functions are expected to be effective in human-system conversations also; however, there is only a few systems using these functions or a research corpus including such functions. We created a corpus that consists of users’ short responses to an actual conversation system and developed a model for recognizing the intention of these responses. First, we categorized the intention of feedback that affects the progress of conversations. We then collected 15604 short responses of users from 2060 conversation sessions using our news-delivery conversation system. Twelve annotators labeled each utterance based on intention through a listening test. We then designed our deep-neural-network-based intention recognition model using the collected data. We found that feedback in the form of questions, which is the most frequently occurring expression, was correctly recognized and contributed to the efficiency of the conversation system.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114856153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Optimizing Neural Response Generator with Emotional Impact Information 利用情绪影响信息优化神经反应发生器

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639613

Nurul Lubis, S. Sakti, Koichiro Yoshino, Satoshi Nakamura

引用次数: 2

Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models 基于知识精馏的小足迹声学模型高效构建策略

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639545

Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono

{"title":"Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models","authors":"Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono","doi":"10.1109/SLT.2018.8639545","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639545","url":null,"abstract":"In this paper, we propose a novel training strategy for deep neural network (DNN) based small-footprint acoustic models. The accuracy of DNN-based automatic speech recognition (ASR) systems can be greatly improved by leveraging large amounts of data to improve the level of expression. DNNs use many parameters to enhance recognition performance. Unfortunately, resource-constrained local devices are unable to run complex DNN-based ASR systems. For building compact acoustic models, the knowledge distillation (KD) approach is often used. KD uses a large, well-trained model that outputs target labels to train a compact model. However, the standard KD cannot fully utilize the large model outputs to train compact models because the soft logits provide only rough information. We assume that the large model must give more useful hints to the compact model. We propose an advanced KD that uses mean squared error to minimize the discrepancies between the final hidden layer outputs. We evaluate our proposal on recorded speech data sets assuming car-and home-use scenarios, and show that our models achieve lower character error rates than the conventional KD approach or from-scratch training on computation resource-constrained devices.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124081347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition 基于神经网络的快速说话人自适应滤波库层自动语音识别

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639648

Hiroshi Seki, Kazumasa Yamamoto, T. Akiba, S. Nakagawa

引用次数: 4

Exploring Layer Trajectory LSTM with Depth Processing Units and Attention 利用深度处理单元和注意力探索层轨迹LSTM

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639637

Jinyu Li, Liang Lu, Changliang Liu, Y. Gong

{"title":"Exploring Layer Trajectory LSTM with Depth Processing Units and Attention","authors":"Jinyu Li, Liang Lu, Changliang Liu, Y. Gong","doi":"10.1109/SLT.2018.8639637","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639637","url":null,"abstract":"Traditional LSTM model and its variants normally work in a frame-by-frame and layer-by-layer fashion, which deals with the temporal modeling and target classification problems at the same time. In this paper, we extend our recently proposed layer trajectory LSTM (ltLSTM) and present a generalized framework, which is equipped with a depth processing block that scans the hidden states of each time-LSTM layer, and uses the summarized layer trajectory information for final senone classification. We explore different modeling units used in the depth processing block to have a good tradeoff between accuracy and runtime cost. Furthermore, we integrate an attention module into this framework to explore wide context information, which is especially beneficial for uni-directional LSTMs. Trained with 30 thousand hours of EN-US Microsoft internal data and cross entropy criterion, the proposed generalized ltLSTM performed significantly better than the standard multi-layer time-LSTM, with up to 12.8% relative word error rate (WER) reduction across different tasks. With attention modeling, the relative WER reduction can be up to 17.9%. We observed similar gain when the models were trained with sequence discriminative training criterion.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123140661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6