2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第10页

SLU for Voice Command in Smart Home: Comparison of Pipeline and End-to-End Approaches 智能家居中语音命令的SLU:管道和端到端方法的比较

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003891

Thierry Desot, François Portet, Michel Vacher

{"title":"SLU for Voice Command in Smart Home: Comparison of Pipeline and End-to-End Approaches","authors":"Thierry Desot, François Portet, Michel Vacher","doi":"10.1109/ASRU46091.2019.9003891","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003891","url":null,"abstract":"Spoken Language Understanding (SLU) is typically performed through automatic speech recognition (ASR) and natural language understanding (NLU) in a pipeline. However, errors at the ASR stage have a negative impact on the NLU performance. Hence, there is a rising interest in End-to-End (E2E) SLU to jointly perform ASR and NLU. Although E2E models have shown superior performance to modular approaches in many NLP tasks, current SLU E2E models have still not definitely superseded pipeline approaches. In this paper, we present a comparison of the pipeline and E2E approaches for the task of voice command in smart homes. Since there are no large non-English domain-specific data sets available, although needed for an E2E model, we tackle the lack of such data by combining Natural Language Generation (NLG) and text-to-speech (TTS) to generate French training data. The trained models were evaluated on voice commands acquired in a real smart home with several speakers. Results show that the E2E approach can reach performances similar to a state-of-the art pipeline SLU despite a higher WER than the pipeline approach. Furthermore, the E2E model can benefit from artificially generated data to exhibit lower Concept Error Rates than the pipeline baseline for slot recognition.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132155924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Monotonic Recurrent Neural Network Transducer and Decoding Strategies 单调递归神经网络传感器与解码策略

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003822

Anshuman Tripathi, Han Lu, H. Sak, H. Soltau

{"title":"Monotonic Recurrent Neural Network Transducer and Decoding Strategies","authors":"Anshuman Tripathi, Han Lu, H. Sak, H. Soltau","doi":"10.1109/ASRU46091.2019.9003822","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003822","url":null,"abstract":"Recurrent Neural Network Transducer (RNNT) is an end-to-end model which transduces discrete input sequences to output sequences by learning alignments between the sequences. In speech recognition tasks we generally have a strictly monotonic alignment between time frames and label sequence. However, the standard RNNT loss does not enforce this constraint. This can cause some anomalies in alignments such as the model outputting a sequence of labels at a single time frame. There is also no bound on the decoding time steps. To address these problems, we introduce a monotonic version of the RNNT loss. Under the assumption that the output sequence is not longer than the input sequence, this loss can be used with forward-backward algorithm to learn strictly monotonic alignments between the sequences. We present experimental studies showing that speech recognition accuracy for monotonic RNNT is equivalent to standard RNNT. We also explore best-first and breadth-first decoding strategies for both monotonic and standard RNNT models. Our experiments show that breadth-first search is effective in exploring and combining alternative alignments. Additionally, it also allows batching of hypotheses during search label expansion, allowing better resource utilization, and resulting in decoding speedup.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132430047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Using Very Deep Convolutional Neural Networks to Automatically Detect Plagiarized Spoken Responses 使用深度卷积神经网络自动检测抄袭的口语回答

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003924

Xinhao Wang, Keelan Evanini, Yao Qian, K. Zechner

引用次数: 0

Towards Controlling False Alarm - Miss Trade-Off in Perceptual Speaker Comparison via Non-Neutral Listening Task Framing 利用非中立聆听任务框架控制感知说话人比较中的虚警缺失权衡

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003978

Rosa González Hautamäki, T. Kinnunen

{"title":"Towards Controlling False Alarm - Miss Trade-Off in Perceptual Speaker Comparison via Non-Neutral Listening Task Framing","authors":"Rosa González Hautamäki, T. Kinnunen","doi":"10.1109/ASRU46091.2019.9003978","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003978","url":null,"abstract":"Speaker comparison by listening is a valuable resource, for instance, in human voice discrimination studies, and voice conversion (VC) systems evaluations. Usually, listeners are provided with application-neutral guidelines that encourage retaining overall high speaker discrimination accuracy. Nonetheless, listeners are subject to misses (declaring same-speaker trial as different-speaker) and false alarms (vice versa) with possibly non-symmetric outcomes. In automatic speaker verification (ASV) applications, the consequences of a miss and a false alarm are rarely equal, and decision making policy is adjusted towards a given application with a desired miss/false alarm trade-off. We study whether listener decisions could similarly be controlled to provoke more accept (or reject) decisions, by framing the voice comparison task in different ways. Our neutral, forensic, user-convenient bank and secure bank scenarios are played by disjoint panels (through Amazon's Mechanical Turk), all judging the same speaker trials originated from RedDots and 2018 Voice Conversion Challenge (VCC 2018) data. Our results indicate that listener decisions can be influenced by modifying the task framing. As a subjective task, the challenge is how to drive the panel decisions to the desired direction (to reduce miss or false alarm rate). Our preliminary results suggest potential for novel, application-directed speaker discrimination designs.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115035978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Spatio-Temporal Context Modelling for Speech Emotion Classification 语音情感分类的时空语境建模

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004037

Md. Asif Jalal, Roger K. Moore, Thomas Hain

{"title":"Spatio-Temporal Context Modelling for Speech Emotion Classification","authors":"Md. Asif Jalal, Roger K. Moore, Thomas Hain","doi":"10.1109/ASRU46091.2019.9004037","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004037","url":null,"abstract":"Speech emotion recognition (SER) is a requisite for emotional intelligence that affects the understanding of speech. One of the most crucial tasks is to obtain patterns having a maximum correlation for the emotion classification task from the speech signal while being invariant to the changes in frequency, time and other external distortions. Therefore, learning emotional contextual feature representation independent of speaker and environment is essential. In this paper, a novel spatiotemporal context modelling framework for robust SER is proposed to learn feature representation by using acoustic context expansion with high dimensional feature projection. The framework uses a deep convolutional neural network (CNN) and self-attention network. The CNNs combine spatiotemporal features. The attention network produces high dimensional task-specific features and combines these features for context modelling, which altogether provides a state-of-the-art technique for classifying the extracted patterns for speech emotion. Speech emotion is a categorical perception representing discrete sensory events. The proposed approach is compared with a wide range of architectures on the RAVDESS and IEMOCAP corpora for 8-class and 4-class emotion classification tasks and remarkable gain over state-of-the-art systems are obtained, absolutely 15%, 10% respectively.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"192 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116783173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A Comparative Study on End-to-End Speech to Text Translation 端到端语音与文本翻译的比较研究

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-11-20 DOI: 10.1109/ASRU46091.2019.9003774

Parnia Bahar, Tobias Bieschke, H. Ney

引用次数: 62

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition 用于视听语音识别的递归神经网络换能器

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-11-08 DOI: 10.1109/ASRU46091.2019.9004036

Takaki Makino, H. Liao, Yannis Assael, Brendan Shillingford, Basi García, Otavio Braga, O. Siohan

引用次数: 96

A Comparison of End-to-End Models for Long-Form Speech Recognition 长格式语音识别的端到端模型比较

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-11-06 DOI: 10.1109/ASRU46091.2019.9003854

C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu

{"title":"A Comparison of End-to-End Models for Long-Form Speech Recognition","authors":"C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu","doi":"10.1109/ASRU46091.2019.9003854","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003854","url":null,"abstract":"End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114195088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-Gans 基于循环gan的低资源域自适应说话人识别

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-10-25 DOI: 10.1109/ASRU46091.2019.9003748

P. S. Nidadavolu, Saurabh Kataria, J. Villalba, N. Dehak

{"title":"Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-Gans","authors":"P. S. Nidadavolu, Saurabh Kataria, J. Villalba, N. Dehak","doi":"10.1109/ASRU46091.2019.9003748","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003748","url":null,"abstract":"Current speaker recognition technology provides great performance with the x-vector approach. However, performance decreases when the evaluation domain is different from the training domain, an issue usually addressed with domain adaptation approaches. Recently, unsupervised domain adaptation using cycle-consistent Generative Adversarial Networks (CycleGAN) has received a lot of attention. Cycle-GAN learn mappings between features of two domains given non-parallel data. We investigate their effectiveness in low resource scenario i.e. when limited amount of target domain data is available for adaptation, a case unexplored in previous works. We experiment with two adaptation tasks: microphone to telephone and a novel reverberant to clean adaptation with the end goal of improving speaker recognition performance. Number of speakers present in source and target domains are 7000 and 191 respectively. By adding noise to the target domain during CycleGAN training, we were able to achieve better performance compared to the adaptation system whose CycleGAN was trained on a larger target data. On reverberant to clean adaptation task, our models improved EER by 18.3% relative on VOiCES dataset compared to a system trained on clean data. They also slightly improved over the state-of-the-art Weighted Prediction Error (WPE) de-reverberation algorithm.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131165040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Recognizing Long-Form Speech Using Streaming End-to-End Models 使用端到端流模型识别长格式语音

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-10-24 DOI: 10.1109/ASRU46091.2019.9003913

A. Narayanan, Rohit Prabhavalkar, C. Chiu, David Rybach, Tara N. Sainath, Trevor Strohman

引用次数: 117