A Unified Endpointer Using Multitask and Multidomain Training

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9003787

Shuo-yiin Chang, Bo Li, Gabor Simko

{"title":"A Unified Endpointer Using Multitask and Multidomain Training","authors":"Shuo-yiin Chang, Bo Li, Gabor Simko","doi":"10.1109/ASRU46091.2019.9003787","DOIUrl":null,"url":null,"abstract":"In speech recognition systems, we generally differentiate the role of endpointers between long-form speech and voice queries, where they are responsible for speech detection and query endpoint detection respectively. Detection of speech is useful for segmentation and pre-filtering in long-form speech processing. On the other hand, query endpoint detection predicts when to stop listening and send audio received so far for actions. It thus determines system latency and is an essential component for interactive voice systems. For both tasks, endpointer needs to be robust in challenging environments, including noisy conditions, reverberant environments and environments with background speech, and it has to generalize well to different domains with different speaking styles and rhythms. This work investigates building a unified endpointer by folding the separate speech detection and query endpoint detection tasks into a single neural network model through multitask learning. A categorical domain representation is further incorporated into the model to encourage learning domain specific information. The final unified model achieves around 100 ms (18% relatively) latency improvement for near-field voice queries and 150 ms (21% relatively) for far-field voice queries over simply pooling all the data together and 7% relative frame error rate reduction for long-form speech compared to a standalone speech detection model. The proposed approach also shows good robustness to noisy environments and yields 180 ms latency improvement on voice queries from an unseen domain.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

In speech recognition systems, we generally differentiate the role of endpointers between long-form speech and voice queries, where they are responsible for speech detection and query endpoint detection respectively. Detection of speech is useful for segmentation and pre-filtering in long-form speech processing. On the other hand, query endpoint detection predicts when to stop listening and send audio received so far for actions. It thus determines system latency and is an essential component for interactive voice systems. For both tasks, endpointer needs to be robust in challenging environments, including noisy conditions, reverberant environments and environments with background speech, and it has to generalize well to different domains with different speaking styles and rhythms. This work investigates building a unified endpointer by folding the separate speech detection and query endpoint detection tasks into a single neural network model through multitask learning. A categorical domain representation is further incorporated into the model to encourage learning domain specific information. The final unified model achieves around 100 ms (18% relatively) latency improvement for near-field voice queries and 150 ms (21% relatively) for far-field voice queries over simply pooling all the data together and 7% relative frame error rate reduction for long-form speech compared to a standalone speech detection model. The proposed approach also shows good robustness to noisy environments and yields 180 ms latency improvement on voice queries from an unseen domain.

查看原文本刊更多论文

使用多任务和多域训练的统一终端指针

在语音识别系统中，我们通常区分长格式语音和语音查询端点的作用，它们分别负责语音检测和查询端点检测。语音检测对于长格式语音处理中的分割和预滤波是非常有用的。另一方面，查询端点检测预测何时停止侦听，并将迄今为止接收到的音频发送给操作。因此，它决定了系统延迟，是交互式语音系统的重要组成部分。对于这两项任务，端点都需要在具有挑战性的环境中保持鲁棒性，包括噪声条件、混响环境和背景语音环境，并且它必须能够很好地泛化到具有不同说话风格和节奏的不同领域。本研究通过多任务学习，将独立的语音检测和查询端点检测任务折叠成单个神经网络模型，从而构建统一的端点指针。在模型中进一步加入了分类领域表示，以鼓励学习特定领域的信息。最终的统一模型在近场语音查询中实现了大约100毫秒(相对18%)的延迟改善，在远场语音查询中实现了150毫秒(相对21%)的延迟改善，与独立语音检测模型相比，长格式语音的相对帧错误率降低了7%。该方法对噪声环境也表现出良好的鲁棒性，对来自未知域的语音查询的延迟提高了180 ms。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量