SLU for Voice Command in Smart Home: Comparison of Pipeline and End-to-End Approaches

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9003891

Thierry Desot, François Portet, Michel Vacher

{"title":"SLU for Voice Command in Smart Home: Comparison of Pipeline and End-to-End Approaches","authors":"Thierry Desot, François Portet, Michel Vacher","doi":"10.1109/ASRU46091.2019.9003891","DOIUrl":null,"url":null,"abstract":"Spoken Language Understanding (SLU) is typically performed through automatic speech recognition (ASR) and natural language understanding (NLU) in a pipeline. However, errors at the ASR stage have a negative impact on the NLU performance. Hence, there is a rising interest in End-to-End (E2E) SLU to jointly perform ASR and NLU. Although E2E models have shown superior performance to modular approaches in many NLP tasks, current SLU E2E models have still not definitely superseded pipeline approaches. In this paper, we present a comparison of the pipeline and E2E approaches for the task of voice command in smart homes. Since there are no large non-English domain-specific data sets available, although needed for an E2E model, we tackle the lack of such data by combining Natural Language Generation (NLG) and text-to-speech (TTS) to generate French training data. The trained models were evaluated on voice commands acquired in a real smart home with several speakers. Results show that the E2E approach can reach performances similar to a state-of-the art pipeline SLU despite a higher WER than the pipeline approach. Furthermore, the E2E model can benefit from artificially generated data to exhibit lower Concept Error Rates than the pipeline baseline for slot recognition.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Spoken Language Understanding (SLU) is typically performed through automatic speech recognition (ASR) and natural language understanding (NLU) in a pipeline. However, errors at the ASR stage have a negative impact on the NLU performance. Hence, there is a rising interest in End-to-End (E2E) SLU to jointly perform ASR and NLU. Although E2E models have shown superior performance to modular approaches in many NLP tasks, current SLU E2E models have still not definitely superseded pipeline approaches. In this paper, we present a comparison of the pipeline and E2E approaches for the task of voice command in smart homes. Since there are no large non-English domain-specific data sets available, although needed for an E2E model, we tackle the lack of such data by combining Natural Language Generation (NLG) and text-to-speech (TTS) to generate French training data. The trained models were evaluated on voice commands acquired in a real smart home with several speakers. Results show that the E2E approach can reach performances similar to a state-of-the art pipeline SLU despite a higher WER than the pipeline approach. Furthermore, the E2E model can benefit from artificially generated data to exhibit lower Concept Error Rates than the pipeline baseline for slot recognition.

查看原文本刊更多论文

智能家居中语音命令的SLU:管道和端到端方法的比较

口语理解(SLU)通常通过管道中的自动语音识别(ASR)和自然语言理解(NLU)来实现。然而，在ASR阶段的误差对NLU的性能有负面影响。因此，人们对端到端(E2E) SLU联合执行ASR和NLU的兴趣越来越大。虽然E2E模型在许多NLP任务中表现出比模块化方法更好的性能，但目前的SLU E2E模型仍然没有完全取代管道方法。在本文中，我们对智能家居中语音命令任务的管道和端到端方法进行了比较。由于没有大型的非英语领域特定数据集可用，尽管E2E模型需要这些数据集，我们通过结合自然语言生成(NLG)和文本到语音(TTS)来生成法语训练数据来解决此类数据的缺乏。训练后的模型在一个有几个扬声器的真实智能家居中获得的语音命令上进行评估。结果表明，E2E方法可以达到与最先进的管道SLU相似的性能，尽管比管道方法具有更高的WER。此外，E2E模型可以从人工生成的数据中获益，在槽位识别方面，其概念错误率比管道基线低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量